SOTA Paper Recommendation
Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
Purshow Review
An exploration of scaling in auto-encoders.
Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens
SOTA Review
Integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance.
Diffusion Adversarial Post-Training for One-Step Video Generation
SOTA Review
DiT as GAN achieves one-step generation in video.
GameFactory: Creating New Games with Generative Interactive Videos
SOTA Review
By learning motion controls from a small-scale first-person Minecraft dataset, this framework can transfer these control capabilities to open-domain videos.
LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models
SOTA Review
Feature fusion in MLLMs for MoE vision encoder, should be scaling to make stonger conclusions.
An Empirical Study of Autoregressive Pre-training from Videos
SOTA Review
Pretrain on video using generative method, although performance is not good enough, but it's a good try.
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
SOTA Review
I think semantic-level token clustering is a good attempt in AR image generation. LCM has tried sentence-level generation in AR language generation. Perhaps instead of simple clustering, people can try to do something similar in image?
Dual Diffusion for Unified Image Generation and Understanding
SOTA Review
Unified understanding and generative model but using generative model (DiTs) as backbone.
MLLM-as-a-Judge for Image Safety without Human Labeling
SOTA Review
LLM-as-judge is already very common in the field of language models. There was a version of MLLMs before, but not solid enough. I think the bar for this kind of research is very low, but it is difficult to dig deeper. This paper is able to bring up some interesting points such as context bias in images.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
SOTA Review
Good findings, bad methods.
1.58-bit FLUX
SOTA Review
MLsys is important! But the details are not enough to comment and not open source.
DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering
SOTA Review
Using the potential discrimination ability of DMs for fine-grained clustering.I think this work is worth expanding, especially since there are similar papers that explore the classification ability of DMs (although it is a coarse classification). It also further illustrates the potential of using generative models for representation learning.
Generative Video Propagation
SOTA Review
A magical framework that uses artificial data to promote related tasks. A strong counterexample constraint is given to be the learning objective. I think there is a certain potential to use it in motion or more implicit control references.
CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up
SOTA Review
The locality improvement based on conv is definitely helpful to transformer, and the introduction into DiTs is also an inevitable trend. This paper is worth learning from because it is well written (clear motivation and method), and the code maintenance is also worth learning.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
SOTA Review
This paper reveals how generation and understanding can be empirically proven to promote each other. At the same time, I noticed that the features of VIT encoder are very powerful and our existing fine-tuning methods have not fully utilized them.
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
SOTA Review
Simple but effective. The combination of language supervised and unsupervised SSL. I think it is good idea to test on MMVP.
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks
SOTA Review
General benchmark for vla, not really impressive, but can be referenced in the future design.