SOTA Paper Recommendation

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

Rating: ★★★★☆

SOTA Review

For a world model with strong generalization ability, I think the key point is still simulation data.

Paper Link

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Rating: ★★★☆☆

Purshow Review

Training a "unify encoder" by jointly reconstructing and aligning is a very important and interesting way.

Paper Link

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Rating: ★★★☆☆

Purshow Review

One question I'm concerned about is whether the use of discrete THINKING embedding is detrimental to fine-grained reasoning, and I think it should be validated on the current high-stakes MATH dataset.

Paper Link

Training Large Language Models to Reason in a Continuous Latent Space

Rating: ★★★★☆

Purshow Review

Very interesting exploration, looking forward to further experimenting on LLM with a larger size.

Paper Link

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Rating: ★★★☆☆

SOTA Review

Yuwei is scooped by this paper, but I think the design of embedding metrix is quite reasonable for future modality interaction.

Paper Link

Diffusion Autoencoders are Scalable Image Tokenizers

Rating: ★★★★☆

Purshow Review

A novel image tokenization method using diffusion models. Simple yet effective.

Paper Link

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Rating: ★★★★☆

Purshow Review

This paper effectively contrasts RL and SFT, demonstrating RL's superior generalization capabilities and highlighting SFT's tendency to memorize training data rather than generalize. I think it would be very interesting to explore more scenarios.

Paper Link

Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

Rating: ★★★☆☆

Purshow Review

A first exploration of cot image generation in "mask AR". Some problems are worth further exploration: 1. Adapt the method to more models, even normal AR models instead of mask AR, 2. Suitable benchmarks that truly test the model's cot benefits 3. Potential for unifying with AR understanding COT

Paper Link

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Rating: ★★★★☆

Purshow Review

An exploration of scaling in auto-encoders.

Paper Link

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Rating: ★★★★☆

SOTA Review

Integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance.

Paper Link

Diffusion Adversarial Post-Training for One-Step Video Generation

Rating: ★★★★☆

SOTA Review

DiT as GAN achieves one-step generation in video.

Paper Link

GameFactory: Creating New Games with Generative Interactive Videos

Rating: ★★★★☆

SOTA Review

By learning motion controls from a small-scale first-person Minecraft dataset, this framework can transfer these control capabilities to open-domain videos.

Paper Link

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Rating: ★★★☆☆

SOTA Review

Feature fusion in MLLMs for MoE vision encoder, should be scaling to make stonger conclusions.

Paper Link

An Empirical Study of Autoregressive Pre-training from Videos

Rating: ★★★★☆

SOTA Review

Pretrain on video using generative method, although performance is not good enough, but it's a good try.

Paper Link

Gaussian Masked Autoencoders

Rating: ★★★★☆

SOTA Review

3D MAE for GS, very solid meta style.

Paper Link

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Rating: ★★★☆☆

SOTA Review

I think semantic-level token clustering is a good attempt in AR image generation. LCM has tried sentence-level generation in AR language generation. Perhaps instead of simple clustering, people can try to do something similar in image?

Paper Link

Dual Diffusion for Unified Image Generation and Understanding

Rating: ★★★★☆

SOTA Review

Unified understanding and generative model but using generative model (DiTs) as backbone.

Paper Link

MLLM-as-a-Judge for Image Safety without Human Labeling

Rating: ★★★★☆

SOTA Review

LLM-as-judge is already very common in the field of language models. There was a version of MLLMs before, but not solid enough. I think the bar for this kind of research is very low, but it is difficult to dig deeper. This paper is able to bring up some interesting points such as context bias in images.

Paper Link

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Rating: ★★★★☆ (for findings)

SOTA Review

Good findings, bad methods.

Paper Link

1.58-bit FLUX

Rating: ★★★★☆ (if open)

SOTA Review

MLsys is important! But the details are not enough to comment and not open source.

Paper Link

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Rating: ★★★☆☆

SOTA Review

Using the potential discrimination ability of DMs for fine-grained clustering.I think this work is worth expanding, especially since there are similar papers that explore the classification ability of DMs (although it is a coarse classification). It also further illustrates the potential of using generative models for representation learning.

Paper Link

Generative Video Propagation

Rating: ★★★☆☆

SOTA Review

A magical framework that uses artificial data to promote related tasks. A strong counterexample constraint is given to be the learning objective. I think there is a certain potential to use it in motion or more implicit control references.

Paper Link

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Rating: ★★☆☆☆

SOTA Review

The locality improvement based on conv is definitely helpful to transformer, and the introduction into DiTs is also an inevitable trend. This paper is worth learning from because it is well written (clear motivation and method), and the code maintenance is also worth learning.

Paper Link

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Rating: ★★★★☆

SOTA Review

This paper reveals how generation and understanding can be empirically proven to promote each other. At the same time, I noticed that the features of VIT encoder are very powerful and our existing fine-tuning methods have not fully utilized them.

Paper Link

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Rating: ★★★★☆

SOTA Review

Simple but effective. The combination of language supervised and unsupervised SSL. I think it is good idea to test on MMVP.

Paper Link

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Rating: ★☆☆☆☆

SOTA Review

General benchmark for vla, not really impressive, but can be referenced in the future design.

Paper Link

Filter by Month

SOTA Paper Recommendation

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control

SOTA Review

QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Purshow Review

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning

Purshow Review

Training Large Language Models to Reason in a Continuous Latent Space

Purshow Review

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

SOTA Review

Diffusion Autoencoders are Scalable Image Tokenizers

Purshow Review

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Purshow Review

Can We Generate Images with CoT? Let’s Verify and Reinforce Image Generation Step by Step

Purshow Review

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Purshow Review

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

SOTA Review

Diffusion Adversarial Post-Training for One-Step Video Generation

SOTA Review

GameFactory: Creating New Games with Generative Interactive Videos

SOTA Review

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

SOTA Review

An Empirical Study of Autoregressive Pre-training from Videos

SOTA Review

Gaussian Masked Autoencoders

SOTA Review

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

SOTA Review

Dual Diffusion for Unified Image Generation and Understanding

SOTA Review

MLLM-as-a-Judge for Image Safety without Human Labeling

SOTA Review

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

SOTA Review

1.58-bit FLUX

SOTA Review

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

SOTA Review

Generative Video Propagation

SOTA Review

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

SOTA Review

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

SOTA Review

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

SOTA Review

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

SOTA Review