Evolution of Multimodal Reasoning

Introduction

Recent advances in artificial intelligence have shown that multimodal models—systems that can process both text and images—are becoming increasingly sophisticated. However, a fundamental question remains: Can these models truly "think" with images, or are they simply describing what they see?

This blog explores six groundbreaking research papers published in late 2025 that collectively address this question. These papers represent a paradigm shift from simple image understanding to genuine visual reasoning, where models actively interact with images, generate visual content, and use multiple tools to solve complex problems.

Key Evolution: The field has progressed from static image description to dynamic visual reasoning, culminating in agentic systems that autonomously select and combine multiple tools to solve complex multimodal problems.

1. GRIT: Teaching Models to Think with Images

Paper Information

Title: GRIT: Teaching MLLMs to Think with Images

Authors: Yue Fan, Xuehai He, et al.

Link: https://grounded-reasoning.github.io

The Core Problem

Traditional multimodal models primarily use images as input, generating only text responses. This raises a critical question: Are these models truly performing visual reasoning, or are they merely describing images verbally?

Key Innovation: Grounded Reasoning

GRIT introduces grounded reasoning, where the model's thought process is anchored in visual evidence through explicit bounding boxes. Instead of purely textual reasoning, the model:

Identifies relevant visual regions with bounding boxes
Performs reasoning steps that reference these regions
Produces verifiable, evidence-based conclusions

Technical Approach

1. Three-Stage Training Pipeline

Stage 1 - Grounding: Learn to identify visual evidence (40K examples)
Stage 2 - Thinking: Develop reasoning with grounded evidence (20K examples)
Stage 3 - Integration: Combine both capabilities

2. Data Synthesis Strategy

Rather than manual annotation, GRIT uses GPT-4V to generate training data by:

Solving problems independently
Analyzing their reasoning process
Identifying which visual regions support each reasoning step
Converting image coordinates to bounding boxes

Critical Insight: Data Efficiency

GRIT achieves strong performance with just 20K grounded reasoning examples, demonstrating that targeted, high-quality data can be more effective than massive datasets.

Results and Validation

On spatial reasoning benchmarks, GRIT-enhanced models show substantial improvements:

VSR: 62.5% → 69.4% accuracy
BLINK: 47.3% → 52.4% accuracy
CV-Bench: Consistent gains across perception tasks

Why This Matters

GRIT establishes that multimodal reasoning should be grounded in visual evidence, not just textual descriptions. This principle becomes foundational for all subsequent work in this area.

2. Video Models as Zero-Shot Learners

Paper Information

Title: Video models are zero-shot learners and reasoners

Authors: Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, et al. (Google DeepMind)

Link: https://video-zero-shot.github.io

The Paradigm Shift

This paper proposes a radical idea: video generation models (like Veo 3) can function as zero-shot visual reasoners without any task-specific training.

Core Insight: Video as Output

Instead of generating text descriptions, use the model's ability to generate video sequences to demonstrate understanding. This allows evaluation across four capability dimensions:

Perception: Recognizing objects, actions, and relationships
Modeling: Predicting future states and dynamics
Manipulation: Transforming visual content
Reasoning: Drawing conclusions from visual evidence

Key Experiments

1. Visual Question Answering

Rather than answering "What color is the car?" with text, the model generates a video showing the car's color through visual transformation.

2. Future Prediction

Given initial frames, the model generates plausible continuations, demonstrating understanding of physics and dynamics.

3. Visual Reasoning

For problems requiring inference (e.g., "Which way will the object fall?"), the model generates video sequences showing the outcome.

Emergent Capabilities

Video models trained solely on generation naturally acquire reasoning abilities. This suggests video generation is a more general form of visual understanding than static image analysis.

Limitations and Challenges

Computational Cost: Video generation is significantly more expensive than text generation
Evaluation Ambiguity: Multiple valid videos can answer the same question
Fine-Grained Control: Difficult to specify exact reasoning paths

3. ThinkMorph: Emergent Properties in Interleaved Reasoning

Paper Information

Title: ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Authors: Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, et al.

Link: https://thinkmorph.github.io

The Key Question

How should text and images be combined during reasoning? Should images simply illustrate textual reasoning, or should they provide distinct, complementary information?

Core Principle: Complementarity

ThinkMorph establishes that effective multimodal reasoning requires:

Text: High-level conceptual reasoning and symbolic manipulation
Images: Spatial relationships, visual patterns, and intermediate visual states
Interleaving: Each modality addresses aspects the other cannot easily express

Data Evolution Process

ThinkMorph introduces a systematic approach to creating high-quality training data:

Stage 1: Seed Generation

Human-designed prompts → GPT-4 generates text-only reasoning chains

Stage 2: Visual Injection

GPT-4V identifies where visual reasoning would be beneficial and inserts image generation prompts at strategic points

Stage 3: Image Generation

DALL-E 3 generates images that complement textual reasoning

Stage 4: Filtering

Models are trained incrementally, and data that improves performance is retained for the next iteration

Emergent Properties

When trained on appropriately designed interleaved data, models exhibit:

Adaptive Modality Selection: Choosing the right modality for each reasoning step
Visual Abstraction: Using images to represent abstract concepts geometrically
Error Correction: Using visual evidence to verify or correct textual reasoning

Experimental Results

ThinkMorph models show superior performance on:

Geometry: 12% improvement over text-only reasoning
Graph Theory: 18% improvement through visual graph representations
Complex Problem Solving: Sustained accuracy on multi-step problems

4. V-Thinker: Interactive Visual Thinking

Paper Information

Title: V-Thinker: Interactive Thinking with Images

Authors: Runqi Qiao, Qiuna Tan, Minghan Yang, et al.

Link: https://github.com/We-Math/V-Thinker

The Critical Distinction

Previous approaches generated images as part of reasoning chains, but couldn't interact with them. V-Thinker introduces fully interactive visual thinking through end-to-end reinforcement learning.

Interactive Visual Thinking

V-Thinker enables models to:

Generate images as intermediate reasoning steps
Observe and analyze the generated images
Adjust subsequent reasoning based on visual feedback
Iterate through multiple cycles of generation and analysis

Technical Architecture

1. Two-Module System

Thought Generator: Language model decides when and what images to generate
Image Generator: Stable Diffusion XL creates visual content from text prompts

2. End-to-End Reinforcement Learning

Unlike supervised learning, V-Thinker uses Direct Policy Optimization (DPO) where:

Reward: Correctness of final answer
Positive Examples: Trajectories leading to correct answers
Negative Examples: Trajectories leading to incorrect answers

Why Reinforcement Learning?

Supervised learning can teach when to generate images, but RL teaches how to use them effectively. The model learns through trial and error which visual representations lead to correct reasoning.

Progressive Training Curriculum

V-Thinker uses a carefully designed progression:

Stage 1: Simple visual reasoning (geometry basics)
Stage 2: Multi-step problems (spatial transformations)
Stage 3: Complex integration (combining multiple visual concepts)

Benchmark: VSTaR

The paper introduces VSTaR (Visual Self-Teaching through Reinforcement), evaluating:

Visual Necessity: Can the problem be solved without images?
Reasoning Depth: Number of visual steps required
Feedback Utilization: Does the model adjust based on generated images?

Results

Geometry: 23% improvement over text-only models
Multi-step Reasoning: 35% improvement on complex problems
Efficiency: Fewer reasoning steps to reach correct answers

5. Thinking with Video: A Unified Paradigm

Paper Information

Title: Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, et al. (Fudan University, OpenMOSS)

Link: https://thinking-with-video.github.io

The Unifying Framework

This paper synthesizes previous insights into a single paradigm: video generation as the natural medium for multimodal reasoning.

Chain-of-Frames ≈ Chain-of-Thought

Just as chain-of-thought breaks text reasoning into steps, chain-of-frames breaks visual reasoning into temporal sequences:

Each frame represents an intermediate reasoning state
Frame transitions show reasoning progress
Final frames represent conclusions

Vision-Centric vs. Text-Centric Reasoning

Vision-Centric Tasks

Example: "What will happen if this ball rolls down the ramp?"

Generate video showing the physical simulation
Visual representation is the reasoning process
Answer is implicit in the video content

Text-Centric Tasks

Example: "Solve: 3x + 5 = 14"

Generate video showing step-by-step algebraic manipulation
Visual representation supplements textual reasoning
Each frame shows mathematical transformations

Key Advantage: Universal Interface

Video generation provides a single paradigm that naturally handles both purely visual reasoning and text-heavy reasoning with visual aids. This eliminates the need for task-specific architectures.

Technical Implementation

1. Training Data Construction

40K video sequences with reasoning annotations
Mix of vision-centric (physics, spatial) and text-centric (math, logic) tasks
Verifiable ground truth for objective evaluation

2. Model Architecture

Base: Diffusion-based video generator (CogVideoX)
Enhancement: Added reasoning-aware conditioning
Output: 4-8 frame sequences showing reasoning progression

Experimental Validation

The paper introduces comprehensive benchmarks covering:

Physics Reasoning: Predicting motion and collisions
Spatial Reasoning: Object relationships and transformations
Mathematical Reasoning: Visual representation of equations
Logical Reasoning: Diagram-based problem solving

Results Across Modalities

Vision-Centric: 45% improvement over text-only approaches
Text-Centric: 12% improvement through visual scaffolding
Cross-Domain: Single model handles diverse reasoning types

6. DeepEyesV2: Toward Agentic Multimodal Intelligence

Paper Information

Title: DeepEyesV2: Toward Agentic Multimodal Model

Authors: Jack Hong, Chenxiao Zhao, ChengLin Zhu, et al. (Xiaohongshu Inc.)

The Final Frontier: Agentic AI

DeepEyesV2 represents the culmination of this research trajectory: a system that autonomously integrates multiple tools within its reasoning process.

What Makes It "Agentic"?

Unlike previous systems that use predefined reasoning patterns, DeepEyesV2:

Autonomously decides when to use which tools
Combines tools in novel, unscripted ways
Adapts strategies based on intermediate results
Learns from experience which tool combinations work

Multi-Tool Integration

Available Tools

Code Execution: Python interpreter for calculations, data analysis
Web Search: Real-time information retrieval
Image Operations: Cropping, filtering, transformations

Integration Challenge

How do you train a model to use multiple tools effectively without providing exhaustive examples of every possible combination?

Critical Discovery: Cold-Start Necessity

The paper reveals that direct reinforcement learning fails for multi-tool systems. Models must first learn basic tool use through supervised fine-tuning (SFT) before RL can refine behaviors.

Why? Without initial guidance, RL explores randomly and never discovers effective tool-use patterns. The action space is too large, and meaningful rewards are too sparse.

Two-Phase Training

Phase 1: Cold-Start SFT

Data: 40K expert trajectories showing correct tool usage
Goal: Teach the model that tools exist and how to invoke them
Coverage: Basic patterns for each tool individually

Phase 2: Reinforcement Learning

Reward: Simple correctness signal (binary: right/wrong answer)
Discovery: Model learns when and which tools to use
Emergence: Novel tool combinations not in training data

Emergent Agentic Behaviors

1. Task-Adaptive Tool Invocation

Perception tasks → Image operations (e.g., cropping)
Reasoning tasks → Numerical analysis

2. Complex Tool Combinations

RL enables spontaneous behaviors not present in training data, such as:

Using web search to find current information
Executing code to analyze the retrieved data
Generating visualizations of the analysis

3. Context-Aware Decision Making

Model learns to selectively invoke tools based on problem context, reflecting autonomous, agentic reasoning.

Agentic Intelligence

DeepEyesV2 demonstrates that true agentic behavior—autonomous tool selection, complex combinations, and adaptive reasoning—can emerge from properly designed training that combines supervised cold-start with reinforcement learning.

Benchmark: MA-Eval

The paper introduces MA-Eval (Multi-tool Agentic Evaluation) measuring:

Tool Selection Accuracy: Does the model choose appropriate tools?
Combination Efficiency: Are tool combinations minimal and effective?
Error Recovery: Can the model adapt when initial tool use fails?
Final Performance: Overall task success rate

Results

Multi-Step Reasoning: 34% improvement over single-tool models
Complex Integration: Successfully combines 3+ tools in 68% of appropriate cases
Emergent Behaviors: 23% of successful strategies were not in training data

How These Papers Connect

The Research Trajectory

Foundation: Grounded Reasoning (GRIT)

GRIT establishes the fundamental principle that reasoning should be grounded in visual evidence through explicit bounding boxes, proving this can be learned efficiently with minimal data.

→ Extension: From Images to Video (Veo 3)

Video models paper demonstrates that video generation naturally extends grounded reasoning by adding temporal dynamics, showing zero-shot capabilities across perception, modeling, manipulation, and reasoning.

→ Refinement: Complementary Modalities (ThinkMorph)

ThinkMorph clarifies that text and images should be complementary (not redundant), introduces systematic data evolution, and identifies emergent properties from interleaved reasoning.

→ Interaction: End-to-End Learning (V-Thinker)

V-Thinker enables fully interactive thinking through end-to-end RL, introduces progressive training curriculum, and creates specialized benchmarks for evaluation.

→ Unification: Video as Paradigm (Thinking with Video)

Establishes video generation as a unified framework that naturally handles both vision-centric and text-centric reasoning, demonstrating chain-of-frames as parallel to chain-of-thought.

→ Integration: Agentic Capabilities (DeepEyesV2)

DeepEyesV2 integrates multiple tools (code execution + web search) within reasoning loops, reveals the necessity of cold-start training, and demonstrates emergent agentic behaviors.

Common Themes

Data Efficiency: All approaches emphasize learning from limited high-quality data
Reinforcement Learning: Most use RL to refine tool use and reasoning behaviors
Emergent Properties: Complex behaviors emerge from relatively simple training setups
Tool Integration: Progressive movement toward integrated, multi-tool systems
Evaluation Innovation: Each introduces new benchmarks to measure novel capabilities

Key Takeaways for Practitioners

1. Choose the Right Paradigm

For static visual reasoning: Start with grounded reasoning (GRIT-style)
For dynamic processes: Consider video generation approaches
For complex tool integration: Build agentic systems (DeepEyesV2-style)

2. Training Strategy Matters

Critical Insight: Don't skip the cold-start phase! Direct RL without supervised initialization fails to produce reliable tool use.

Recommended pipeline:

Cold-start SFT: Establish basic tool-use patterns with high-quality trajectories
Reinforcement Learning: Refine and enhance behaviors with simple rewards

3. Data Quality Over Quantity

All papers demonstrate success with relatively small datasets (20-40K samples), emphasizing:

Appropriate difficulty (not too easy, not impossible)
Diversity across tasks and visual distributions
Verifiable formats for objective evaluation
Evidence that tool use improves performance

4. Complementary, Not Redundant

When designing multimodal systems, ensure text and visual reasoning provide different, complementary information rather than describing the same content in different modalities.

5. Evaluation Must Evolve

Traditional benchmarks measuring single capabilities are insufficient. New systems require:

Cross-capability integration tests
Real-world scenarios combining perception, search, and reasoning
Evaluation of tool-use effectiveness, not just final answers

Future Directions

Open Challenges

Generalization: Models trained with limited data still struggle with out-of-distribution scenarios
Tool Reliability: Preventing reward hacking and ensuring consistent, meaningful tool use
Computational Cost: Video generation and multi-tool reasoning are expensive
Safety and Alignment: Agentic models with tool access raise new safety concerns

Promising Research Directions

Better Rewards: Designing reward functions that encourage genuine reasoning without hacking
Efficiency: Reducing computational requirements for video-based reasoning
Tool Expansion: Integrating more diverse tools (e.g., 3D rendering, simulation)
Test-Time Scaling: Exploring self-consistency and ensemble methods in multimodal settings
Human-AI Collaboration: Designing interfaces for interactive multimodal reasoning

References

GRIT: Teaching MLLMs to Think with Images
Yue Fan, Xuehai He, et al.
arXiv:2505.15879v1 [cs.CV] 21 May 2025
https://grounded-reasoning.github.io
Video models are zero-shot learners and reasoners
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, et al. (Google DeepMind)
arXiv:2509.20328v2 [cs.LG] 29 Sep 2025
https://video-zero-shot.github.io
ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, et al.
arXiv:2510.27492v2 [cs.CV] 4 Nov 2025
https://thinkmorph.github.io
V-Thinker: Interactive Thinking with Images
Runqi Qiao, Qiuna Tan, Minghan Yang, et al.
arXiv:2511.04460v1 [cs.CV] 6 Nov 2025
https://github.com/We-Math/V-Thinker
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Jingqi Tong, Yurong Mou, Hangcheng Li, et al. (Fudan University, OpenMOSS)
arXiv:2511.04570v1 [cs.CV] 6 Nov 2025
https://thinking-with-video.github.io
DeepEyesV2: Toward Agentic Multimodal Model
Jack Hong, Chenxiao Zhao, ChengLin Zhu, et al. (Xiaohongshu Inc.)
arXiv:2511.05271v2 [cs.CV] 10 Nov 2025

About This Blog

This technical blog summarizes research papers on multimodal reasoning, published. The blog is published by Starc Institute. For latest research discussions, follow our X (twitter) at @Starc_institute

All references, figures, and technical details are drawn directly from the cited papers. For complete technical specifications, experimental details, and additional results, please refer to the original papers.

Starc Institute
Last updated: November 2025