SOTA Paper Recommendation

Paper 18 Thumbnail

Learnings from Scaling Visual Tokenizers for Reconstruction and Generation

Rating: ★★★★☆

Purshow Review

An exploration of scaling in auto-encoders.

Paper Link
Paper 17 Thumbnail

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Rating: ★★★★☆

SOTA Review

Integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance.

Paper Link
Paper 16 Thumbnail

Diffusion Adversarial Post-Training for One-Step Video Generation

Rating: ★★★★☆

SOTA Review

DiT as GAN achieves one-step generation in video.

Paper Link
Paper 15 Thumbnail

GameFactory: Creating New Games with Generative Interactive Videos

Rating: ★★★★☆

SOTA Review

By learning motion controls from a small-scale first-person Minecraft dataset, this framework can transfer these control capabilities to open-domain videos.

Paper Link
Paper 14 Thumbnail

LEO: Boosting Mixture of Vision Encoders for Multimodal Large Language Models

Rating: ★★★☆☆

SOTA Review

Feature fusion in MLLMs for MoE vision encoder, should be scaling to make stonger conclusions.

Paper Link
Paper 13 Thumbnail

An Empirical Study of Autoregressive Pre-training from Videos

Rating: ★★★★☆

SOTA Review

Pretrain on video using generative method, although performance is not good enough, but it's a good try.

Paper Link
Paper 12 Thumbnail

Gaussian Masked Autoencoders

Rating: ★★★★☆

SOTA Review

3D MAE for GS, very solid meta style.

Paper Link
Paper 11 Thumbnail

Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction

Rating: ★★★☆☆

SOTA Review

I think semantic-level token clustering is a good attempt in AR image generation. LCM has tried sentence-level generation in AR language generation. Perhaps instead of simple clustering, people can try to do something similar in image?

Paper Link
Paper 10 Thumbnail

Dual Diffusion for Unified Image Generation and Understanding

Rating: ★★★★☆

SOTA Review

Unified understanding and generative model but using generative model (DiTs) as backbone.

Paper Link
Paper 9 Thumbnail

MLLM-as-a-Judge for Image Safety without Human Labeling

Rating: ★★★★☆

SOTA Review

LLM-as-judge is already very common in the field of language models. There was a version of MLLMs before, but not solid enough. I think the bar for this kind of research is very low, but it is difficult to dig deeper. This paper is able to bring up some interesting points such as context bias in images.

Paper Link
Paper 8 Thumbnail

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Rating: ★★★★☆ (for findings)

SOTA Review

Good findings, bad methods.

Paper Link
Paper 7 Thumbnail

1.58-bit FLUX

Rating: ★★★★☆ (if open)

SOTA Review

MLsys is important! But the details are not enough to comment and not open source.

Paper Link
Paper 6 Thumbnail

DiFiC: Your Diffusion Model Holds the Secret to Fine-Grained Clustering

Rating: ★★★☆☆

SOTA Review

Using the potential discrimination ability of DMs for fine-grained clustering.I think this work is worth expanding, especially since there are similar papers that explore the classification ability of DMs (although it is a coarse classification). It also further illustrates the potential of using generative models for representation learning.

Paper Link
Paper 5 Thumbnail

Generative Video Propagation

Rating: ★★★☆☆

SOTA Review

A magical framework that uses artificial data to promote related tasks. A strong counterexample constraint is given to be the learning objective. I think there is a certain potential to use it in motion or more implicit control references.

Paper Link
Paper 4 Thumbnail

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

Rating: ★★☆☆☆

SOTA Review

The locality improvement based on conv is definitely helpful to transformer, and the introduction into DiTs is also an inevitable trend. This paper is worth learning from because it is well written (clear motivation and method), and the code maintenance is also worth learning.

Paper Link
Paper 3 Thumbnail

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

Rating: ★★★★☆

SOTA Review

This paper reveals how generation and understanding can be empirically proven to promote each other. At the same time, I noticed that the features of VIT encoder are very powerful and our existing fine-tuning methods have not fully utilized them.

Paper Link
Paper 2 Thumbnail

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment

Rating: ★★★★☆

SOTA Review

Simple but effective. The combination of language supervised and unsupervised SSL. I think it is good idea to test on MMVP.

Paper Link
Paper 1 Thumbnail

VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks

Rating: ★☆☆☆☆

SOTA Review

General benchmark for vla, not really impressive, but can be referenced in the future design.

Paper Link