SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models

Introducing SRUM (Self-Rewarding for Unified Multimodal Models), a post-training framework that creates a cost-effective, self-iterative optimization loop. SRUM compels a model's understanding component to enhance its generation component for better compositional, reasoning-informed and knowledge-informed generation.

Self-Rewarding Icon
Self-Rewarding Loop: SRUM leverages a Unified Model's own understanding module to provide internal reward signals to its generative module, eliminating the need for costly external judges.
Fine-Grained Icon
Fine-Grained Feedback: Our internal reward is decomposed into a global reward for overall compositional correctness and a local reward for attribute fidelity, enabling multi-scale refinement.
SOTA Icon
State-of-the-Art Improvements: SRUM significantly improves compositional and reasoning-informed generation, achieving SOTA results on T2I-CompBench (82.18 → 88.37) and T2I-ReasonBench (43.82 → 46.75).
Teaser Image


To address the challenge of Unified Multimodal Models (UMMs) failing to generate complex images that match their powerful understanding, we introduce SRUM, a self-rewarding framework. It creates a self-improvement loop where the model's own understanding module acts as an internal "teacher," providing corrective rewards to its generation module without needing new human-labeled data. The core innovation is a two-part reward system focusing on both global composition and local object-level details. This method establishes a new state of the art, significantly boosting image accuracy on key benchmarks like T2I-CompBench (from 82.18 to 88.37) and T2I-ResonBench (from 43.82 to 46.75). This page is structured around three key components:

  1. §Methodology: We detail the SRUM framework, including the rewarding process, the fine-grained judgment design (global and local rewards), and the weighted training objective.
  2. §Results: We showcase how SRUM significantly boosts performance on challenging compositional benchmarks like T2I-CompBench and demonstrate its strong generalization capabilities.
  3. §Analysis and Insights: We provide an empirical analysis of SRUM's components, revealing how different reward designs impact the model's inference process at various stages.


How SRUM Works: A Step-by-Step Pipeline

SRUM Pipeline
Figure 1: The SRUM pipeline, including reward generation, the design of fine-grained rewards, and their application during the reward-weighted training phase.

This section details the SRUM pipeline. Our process begins with generating high-quality image candidates using a Unified Multimodal Models (UMMs). These candidates are then meticulously evaluated by a dual-level system that assesses both local fidelity and global composition. The resulting scores are transformed into a dense, spatially-aware reward map, which is integrated into a novel reward-weighted training objective. This allows for targeted, region-specific model refinement while preventing "reward hacking."

1. Image & Bounding Box Generation

As shown in the figure, our pipeline begins by using a UMMs to synthesize candidate images from input prompts, leveraging a "think" mode (Chain-of-Thought) for high-fidelity outputs. We then produce bounding box proposals for each image. To enable precise grounding for the reward modeling, the UMMs' understanding module filters these proposals, retaining only those that are semantically aligned with the initial prompt.

2. Self-Judgment and Reward Generation

Next, we devise a dual-level judgment mechanism to assess image quality and prompt alignment.

  • Local Judgment:It checks specific parts of an object to see how accurate it is and to find any visual mistakes. Model must rate it on a strict scale from -1.0 to 1.0 and explain exactly why you gave that score, showing your step-by-step thinking. This makes the whole process clear and trustworthy.
  • Global Judgment: It checks if the overall layout of the entire image matches what the user's prompt wanted. To be fair, if the prompt didn't give specific instructions on how things should be arranged, we just give it a neutral score. This way, a perfectly good layout doesn't get marked down unfairly.

Following the judgment, we leverage the UMMs' grounding capabilities to generate fine-grained reward scores for all relevant image regions. These regional rewards are aggregated into a dense reward map for integration into our training objective.

3. Reward-Weighted Training

The core of our method is a novel training objective that uses the reward map to refine the model. It consists of two main components:

  1. A reward-driven term $\mathcal{L}_{\text{r}}$ that uses the global score $\alpha$ and the regional reward map $R$ to modulate the training process. This mechanism enables fine-grained control, encouraging preservation where feedback is positive ($\alpha \cdot R > 0$) and repulsion where it is negative ($\alpha \cdot R < 0$). $$\mathcal{L}_{\text{r}} = \mathbb{E} \left[ \alpha \cdot R \odot \left( v_\theta - (\epsilon - x_0^{~\text{gt}}) \right)^2 \right]$$
  2. A constraint term $\mathcal{L}_{\text{ref}}$ that acts as a regularizer. It prevents the model from making drastic changes that distort the overall image structure, thereby preventing reward hacking. $$\mathcal{L}_{~~\text{ref}} = \mathbb{E} \left[ \left\| v_\theta - (\epsilon - x_0^{~\text{gt}}) \right\|^2 \right]$$

The final training objective is a weighted sum of these two losses. This composite design enables targeted local refinement while maintaining global coherence and safeguarding the output from significant distortion.

$$\mathcal{L}_{~~~\text{Total}} = \mathcal{L}_{\text{r}} + \lambda_{~\text{c}} \cdot \mathcal{L}_{~~\text{ref}}$$

Experiments and Analysis

We validated our SRUM framework across various models and benchmarks to investigate several key aspects:

Experimental Setup

Models: We evaluate SRUM as a post-training phase on two powerful open-source UMMs: Bagel (as our primary model for in-depth analysis, in both standard and Chain-of-Thought modes) and Blip3o (to validate general effectiveness).

Benchmarks: Our training instruction data is sourced from T2I-CompBench, which also serves as our primary evaluation benchmark. To assess generalization, we test in-domain transferability on GenEval and WISE, and broader, out-of-domain reasoning on the challenging T2I-ReasonBench. For objective scoring, we use QwenVL-2.5-72B as the designated multimodal evaluator.

Main Results on T2I-CompBench

SRUM yields substantial and consistent performance gains across nearly all compositional categories. On T2I-CompBench, Bagel+SRUM with CoT achieved the highest score among UMMs at 88.37—a significant 3.91-point improvement over its base version. The impact is most pronounced in categories requiring sophisticated reasoning, where we set new SOTA scores.

Model 3d spatial Color Complex Nonspatial Numeracy Shape Spatial Texture Overall
T2I Models
FLUX.1-dev76.3990.6383.5187.4775.3080.2084.2387.0783.10
FLUX.1-schnell79.3884.5381.9685.5572.8282.2085.4986.3882.29
SD-3-medium77.8391.6384.7386.1272.8083.7288.2089.0384.26
SD-xl-base-172.2577.7575.0085.2857.1472.1877.0878.3874.38
Unified Multimodal Models
Janus-Pro76.1784.2580.2880.4756.4365.1479.6769.6774.01
Show-o288.6187.7387.8885.9169.7473.9986.6082.1782.83
OmniGen282.2192.2286.8788.5172.0083.9590.0790.8885.84
BLIP3o81.7389.9285.5584.7871.6783.7592.4787.4584.66
Bagel77.9889.3083.3285.0370.4081.9481.5287.9382.18
Bagel (CoT)84.6688.8586.1085.6475.3684.3382.7188.0784.46
BLIP3o+SRUM83.7890.2286.5785.1074.5285.4493.8886.5285.75
Bagel+SRUM83.1092.9088.6988.4778.5284.2386.9289.5786.55
Bagel+SRUM (CoT)88.6092.9091.3190.4880.1284.4789.9389.1588.37
Table 1: Comprehensive results on T2I-CompBench. Models with SRUM show significant improvements. Bold indicates the best score in a column. Green rows indicate models enhanced by SRUM.

Empirical Study and Deeper Insights

For a deeper analysis, we compare three variants of the Bagel model: the Base Model, a SFT Model (fine-tuned on its own generated data), and our SRUM Model.

Ablation Studies: What Matters Most?

Our ablation studies confirm that each component of SRUM is critical. As shown in the table below, removing any component degrades performance. The omission of the global reward led to a notable performance drop, underscoring its importance for capturing holistic compositional structure. Similarly, using a simple binarized reward signal was significantly less effective than our fine-grained rewards. The KL constraint proved crucial for training stability and preventing reward hacking, a finding consistent with methods like Direct Preference Optimization (DPO). We also we analyze the effect of different constraint ratios on the experimental outcomes. Across both Bagel with CoT and without CoT configurations, the results consistently indicate that $\lambda_{c} = 0.5$ is the most effective choice.

ab_study
Figure 2: The results of ablation study. $\Delta$ Acc. shows the degradation relative to the complete SRUM method. Specifically, 0-1 Reward represents the binarized reward.
ab_study
Figure 3: Hyperparameters Evaluation on T2I-CompBench. We report the accuracy in different $\lambda$ under two inference modes: CoT and without CoT.

Analysis of the Inference Process

To get a more granular view, we scored the model's output at each step of the inference process for both "layout" and "detail" quality. The analysis revealed that the Chain-of-Thought ("think") mode and our global reward primarily refine the overall layout in the early stages of inference. In contrast, improvements in fine-grained details, driven by our local rewards, emerge in the later stages. This demonstrates that our dual-reward approach effectively teaches the model to optimize both structure and fidelity.

Inference Process Layout Analysis Inference Process Detail Analysis
Figure 4: Layout scores (top) and Detail scores (bottom) at different inference steps. Layout improves early, while detail refinement occurs later.

Impact on Understanding Capabilities

SRUM enhances generative skills with minimal impact on the model's core understanding capabilities. Evaluations on common VLM benchmarks show only marginal fluctuations compared to the base model. Furthermore, by analyzing functional cluster activations, we found that SFT tends to specialize by suppressing irrelevant capabilities. In contrast, SRUM enhances the primary task-relevant functions while maintaining others, promoting more robust and generalizable representations.

Functional cluster activation patterns
Figure 5: Functional cluster activations. SRUM enhances and orchestrates clusters, unlike SFT which simply suppresses them.
BenchmarkBaseSFTSRUM
MME-Perception168716821673
MME-Cognition701683677
MMBench85.084.684.8
MM-Vet67.266.567.0
MMMU55.355.055.2
MathVista73.172.873.0
MMVP69.368.770.0
Table 2: Results on understanding benchmarks show SRUM preserves core VLM capabilities.

Generalization Capabilities

A key strength of SRUM is its robust generalization to new tasks and domains, proving it learns underlying reasoning skills rather than memorizing training data.

In-Domain Generalization: When trained only on T2I-CompBench, our model shows strong transferable skills on the GenEval benchmark. It achieves the highest scores in key compositional areas like Counting and Color Attribute Binding, outperforming both the Base and SFT models.

ModelSingle obj.Two obj.CountingColorsPositionColor attr.
Bagel0.990.940.810.880.640.82
Bagel+SFT0.960.940.790.920.590.78
Bagel+SRUM0.980.940.830.900.640.83
Table 3: In-domain generalization results on GenEval show SRUM's superior compositional skills.

Knowledge-based Generalization: On the WISE benchmark, we found that training SRUM on just one category of reasoning prompts universally enhances performance on the other unseen categories, demonstrating effective knowledge transfer.

WISE Benchmark Results
Figure 6: Results on the WISE benchmark show strong knowledge transfer.

Out-of-Domain Generalization: To test generalization to truly unseen domains, we evaluated our model on T2I-ReasonBench. SRUM achieves a superior understanding of complex instructions compared to both SFT and Base models, confirming that our algorithmic design improves generalization on complex problems from both a data and an algorithmic perspective.

ModelEntityIdiomScientificTextualOverall
Bagel49.7034.4647.5243.5943.82
Bagel+SFT50.5339.4347.4544.0845.37
Bagel+SRUM52.8540.5147.8345.8346.75
Table 4: Out-of-domain results on T2I-ReasonBench. SRUM consistently outperforms baselines in reasoning accuracy.

Discussion

SRUM is just a preliminary exploration of Unified Multimodal Models (UMMs). We found that there is still room for improvement in the prompts for the understanding part during the scoring phase, and we hope to scale this method to larger datasets. This article also utilizes some external prompts to improve performance for illustrative purposes. In fact, it is entirely possible to allow the understanding part to self-play questions and answers to build a more closed-loop training system.

Conclusion

This paper introduces SRUM, a fine-grained post-training framework that enables a model's understanding module to reward its generative module. Addtionally , SRUM decomposes the reward into local and global components, facilitating multi-scale alignment and refinement. Extensive experiments validate SRUM's effectiveness, setting new state-of-the-art results on complex compositional and reasoning benchmarks such as T2I-CompBench and T2I-ReasonBench. The framework demonstrates robust in-domain and out-of-domain generalization, and our empirical analysis confirms the efficacy of the fine-grained reward design. These findings illuminate the synergistic development of understanding and generation capabilities within a single model and establish the principle of self-reward as a promising direction for future research.

BibTeX

@article{jin2025srum,
                title={SRUM: Fine-Grained Self-Rewarding for Unified Multimodal Models},
                author={Jin, Weiyang and Niu, Yuwei and Liao, Jiaqi and Duan, Chengqi and Li, Aoxue and Gao, Shenghua and Liu, Xihui},
                journal={arXiv preprint arXiv:2510.12784},
                year={2025}
            }