SOTA Paper Recommendation

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Rating: ★★★★☆

Purshow Review

An exploration of scaling in auto-encoders.

Paper Link

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Rating: ★★★★☆

SOTA Review

Integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance.

Paper Link

Diffusion Adversarial Post-Training for One-Step Video Generation

Rating: ★★★★☆

SOTA Review

DiT as GAN achieves one-step generation in video.

Paper Link

GameFactory: Creating New Games with Generative Interactive Videos

Rating: ★★★★☆

SOTA Review

By learning motion controls from a small-scale first-person Minecraft dataset, this framework can transfer these control capabilities to open-domain videos.

Paper Link

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Rating: ★★★☆☆

SOTA Review

Feature fusion in MLLMs for MoE vision encoder, should be scaling to make stonger conclusions.

Paper Link

An Empirical Study of Autoregressive Pre-training from Videos

Rating: ★★★★☆

SOTA Review

Pretrain on video using generative method, although performance is not good enough, but it's a good try.

Paper Link

Gaussian Masked Autoencoders

Rating: ★★★★☆

SOTA Review

3D MAE for GS, very solid meta style.

Paper Link

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Rating: ★★★☆☆

SOTA Review

I think semantic-level token clustering is a good attempt in AR image generation. LCM has tried sentence-level generation in AR language generation. Perhaps instead of simple clustering, people can try to do something similar in image?

Paper Link

Dual Diffusion for Unified Image Generation and Understanding

Rating: ★★★★☆

SOTA Review

Unified understanding and generative model but using generative model (DiTs) as backbone.

Paper Link

MLLM-as-a-Judge for Image Safety without Human Labeling

Rating: ★★★★☆

SOTA Review

LLM-as-judge is already very common in the field of language models. There was a version of MLLMs before, but not solid enough. I think the bar for this kind of research is very low, but it is difficult to dig deeper. This paper is able to bring up some interesting points such as context bias in images.

Paper Link

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Rating: ★★★★☆ (for findings)

SOTA Review

Good findings, bad methods.

Paper Link

1.58-bit FLUX

Rating: ★★★★☆ (if open)

SOTA Review

MLsys is important! But the details are not enough to comment and not open source.

Paper Link

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Rating: ★★★☆☆

SOTA Review

Using the potential discrimination ability of DMs for fine-grained clustering.I think this work is worth expanding, especially since there are similar papers that explore the classification ability of DMs (although it is a coarse classification). It also further illustrates the potential of using generative models for representation learning.

Paper Link

Generative Video Propagation

Rating: ★★★☆☆

SOTA Review

A magical framework that uses artificial data to promote related tasks. A strong counterexample constraint is given to be the learning objective. I think there is a certain potential to use it in motion or more implicit control references.

Paper Link

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Rating: ★★☆☆☆

SOTA Review

The locality improvement based on conv is definitely helpful to transformer, and the introduction into DiTs is also an inevitable trend. This paper is worth learning from because it is well written (clear motivation and method), and the code maintenance is also worth learning.

Paper Link

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Rating: ★★★★☆

SOTA Review

This paper reveals how generation and understanding can be empirically proven to promote each other. At the same time, I noticed that the features of VIT encoder are very powerful and our existing fine-tuning methods have not fully utilized them.

Paper Link

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Rating: ★★★★☆

SOTA Review

Simple but effective. The combination of language supervised and unsupervised SSL. I think it is good idea to test on MMVP.

Paper Link

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Rating: ★☆☆☆☆

SOTA Review

General benchmark for vla, not really impressive, but can be referenced in the future design.

Paper Link

Filter by Month

SOTA Paper Recommendation

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Purshow Review

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

SOTA Review

Diffusion Adversarial Post-Training for One-Step Video Generation

SOTA Review

GameFactory: Creating New Games with Generative Interactive Videos

SOTA Review

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

SOTA Review

An Empirical Study of Autoregressive Pre-training from Videos

SOTA Review

Gaussian Masked Autoencoders

SOTA Review

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

SOTA Review

Dual Diffusion for Unified Image Generation and Understanding

SOTA Review

MLLM-as-a-Judge for Image Safety without Human Labeling

SOTA Review

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

SOTA Review

1.58-bit FLUX

SOTA Review

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

SOTA Review

Generative Video Propagation

SOTA Review

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

SOTA Review

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

SOTA Review

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

SOTA Review

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

SOTA Review