Evolution of Multimodal Reasoning

From Grounded Thinking to Agentic AI: A Technical Journey

by Starc Institute, follow us at X (Twitter) @Starc_institute

Introduction

Recent advances in artificial intelligence have shown that multimodal models—systems that can process both text and images—are becoming increasingly sophisticated. However, a fundamental question remains: Can these models truly "think" with images, or are they simply describing what they see?

This blog explores six groundbreaking research papers published in late 2025 that collectively address this question. These papers represent a paradigm shift from simple image understanding to genuine visual reasoning, where models actively interact with images, generate visual content, and use multiple tools to solve complex problems.

Key Evolution: The field has progressed from static image description to dynamic visual reasoning, culminating in agentic systems that autonomously select and combine multiple tools to solve complex multimodal problems.

1. GRIT: Teaching Models to Think with Images

Paper Information

Title: GRIT: Teaching MLLMs to Think with Images

Authors: Yue Fan, Xuehai He, et al.

Link: https://grounded-reasoning.github.io

The Core Problem

Traditional multimodal models primarily use images as input, generating only text responses. This raises a critical question: Are these models truly performing visual reasoning, or are they merely describing images verbally?

Key Innovation: Grounded Reasoning

GRIT introduces grounded reasoning, where the model's thought process is anchored in visual evidence through explicit bounding boxes. Instead of purely textual reasoning, the model:

  • Identifies relevant visual regions with bounding boxes
  • Performs reasoning steps that reference these regions
  • Produces verifiable, evidence-based conclusions

Technical Approach

1. Three-Stage Training Pipeline

  • Stage 1 - Grounding: Learn to identify visual evidence (40K examples)
  • Stage 2 - Thinking: Develop reasoning with grounded evidence (20K examples)
  • Stage 3 - Integration: Combine both capabilities

2. Data Synthesis Strategy

Rather than manual annotation, GRIT uses GPT-4V to generate training data by:

  1. Solving problems independently
  2. Analyzing their reasoning process
  3. Identifying which visual regions support each reasoning step
  4. Converting image coordinates to bounding boxes

Critical Insight: Data Efficiency

GRIT achieves strong performance with just 20K grounded reasoning examples, demonstrating that targeted, high-quality data can be more effective than massive datasets.

Results and Validation

On spatial reasoning benchmarks, GRIT-enhanced models show substantial improvements:

  • VSR: 62.5% → 69.4% accuracy
  • BLINK: 47.3% → 52.4% accuracy
  • CV-Bench: Consistent gains across perception tasks

Why This Matters

GRIT establishes that multimodal reasoning should be grounded in visual evidence, not just textual descriptions. This principle becomes foundational for all subsequent work in this area.

2. Video Models as Zero-Shot Learners

Paper Information

Title: Video models are zero-shot learners and reasoners

Authors: Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, et al. (Google DeepMind)

Link: https://video-zero-shot.github.io

The Paradigm Shift

This paper proposes a radical idea: video generation models (like Veo 3) can function as zero-shot visual reasoners without any task-specific training.

Core Insight: Video as Output

Instead of generating text descriptions, use the model's ability to generate video sequences to demonstrate understanding. This allows evaluation across four capability dimensions:

  • Perception: Recognizing objects, actions, and relationships
  • Modeling: Predicting future states and dynamics
  • Manipulation: Transforming visual content
  • Reasoning: Drawing conclusions from visual evidence

Key Experiments

1. Visual Question Answering

Rather than answering "What color is the car?" with text, the model generates a video showing the car's color through visual transformation.

2. Future Prediction

Given initial frames, the model generates plausible continuations, demonstrating understanding of physics and dynamics.

3. Visual Reasoning

For problems requiring inference (e.g., "Which way will the object fall?"), the model generates video sequences showing the outcome.

Emergent Capabilities

Video models trained solely on generation naturally acquire reasoning abilities. This suggests video generation is a more general form of visual understanding than static image analysis.

Limitations and Challenges

  • Computational Cost: Video generation is significantly more expensive than text generation
  • Evaluation Ambiguity: Multiple valid videos can answer the same question
  • Fine-Grained Control: Difficult to specify exact reasoning paths

3. ThinkMorph: Emergent Properties in Interleaved Reasoning

Paper Information

Title: ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning

Authors: Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, et al.

Link: https://thinkmorph.github.io

The Key Question

How should text and images be combined during reasoning? Should images simply illustrate textual reasoning, or should they provide distinct, complementary information?

Core Principle: Complementarity

ThinkMorph establishes that effective multimodal reasoning requires:

  • Text: High-level conceptual reasoning and symbolic manipulation
  • Images: Spatial relationships, visual patterns, and intermediate visual states
  • Interleaving: Each modality addresses aspects the other cannot easily express

Data Evolution Process

ThinkMorph introduces a systematic approach to creating high-quality training data:

Stage 1: Seed Generation

Human-designed prompts → GPT-4 generates text-only reasoning chains

Stage 2: Visual Injection

GPT-4V identifies where visual reasoning would be beneficial and inserts image generation prompts at strategic points

Stage 3: Image Generation

DALL-E 3 generates images that complement textual reasoning

Stage 4: Filtering

Models are trained incrementally, and data that improves performance is retained for the next iteration

Emergent Properties

When trained on appropriately designed interleaved data, models exhibit:

  • Adaptive Modality Selection: Choosing the right modality for each reasoning step
  • Visual Abstraction: Using images to represent abstract concepts geometrically
  • Error Correction: Using visual evidence to verify or correct textual reasoning

Experimental Results

ThinkMorph models show superior performance on:

  • Geometry: 12% improvement over text-only reasoning
  • Graph Theory: 18% improvement through visual graph representations
  • Complex Problem Solving: Sustained accuracy on multi-step problems

4. V-Thinker: Interactive Visual Thinking

Paper Information

Title: V-Thinker: Interactive Thinking with Images

Authors: Runqi Qiao, Qiuna Tan, Minghan Yang, et al.

Link: https://github.com/We-Math/V-Thinker

The Critical Distinction

Previous approaches generated images as part of reasoning chains, but couldn't interact with them. V-Thinker introduces fully interactive visual thinking through end-to-end reinforcement learning.

Interactive Visual Thinking

V-Thinker enables models to:

  • Generate images as intermediate reasoning steps
  • Observe and analyze the generated images
  • Adjust subsequent reasoning based on visual feedback
  • Iterate through multiple cycles of generation and analysis

Technical Architecture

1. Two-Module System

  • Thought Generator: Language model decides when and what images to generate
  • Image Generator: Stable Diffusion XL creates visual content from text prompts

2. End-to-End Reinforcement Learning

Unlike supervised learning, V-Thinker uses Direct Policy Optimization (DPO) where:

  • Reward: Correctness of final answer
  • Positive Examples: Trajectories leading to correct answers
  • Negative Examples: Trajectories leading to incorrect answers

Why Reinforcement Learning?

Supervised learning can teach when to generate images, but RL teaches how to use them effectively. The model learns through trial and error which visual representations lead to correct reasoning.

Progressive Training Curriculum

V-Thinker uses a carefully designed progression:

  1. Stage 1: Simple visual reasoning (geometry basics)
  2. Stage 2: Multi-step problems (spatial transformations)
  3. Stage 3: Complex integration (combining multiple visual concepts)

Benchmark: VSTaR

The paper introduces VSTaR (Visual Self-Teaching through Reinforcement), evaluating:

  • Visual Necessity: Can the problem be solved without images?
  • Reasoning Depth: Number of visual steps required
  • Feedback Utilization: Does the model adjust based on generated images?

Results

  • Geometry: 23% improvement over text-only models
  • Multi-step Reasoning: 35% improvement on complex problems
  • Efficiency: Fewer reasoning steps to reach correct answers

5. Thinking with Video: A Unified Paradigm

Paper Information

Title: Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, et al. (Fudan University, OpenMOSS)

Link: https://thinking-with-video.github.io

The Unifying Framework

This paper synthesizes previous insights into a single paradigm: video generation as the natural medium for multimodal reasoning.

Chain-of-Frames ≈ Chain-of-Thought

Just as chain-of-thought breaks text reasoning into steps, chain-of-frames breaks visual reasoning into temporal sequences:

  • Each frame represents an intermediate reasoning state
  • Frame transitions show reasoning progress
  • Final frames represent conclusions

Vision-Centric vs. Text-Centric Reasoning

Vision-Centric Tasks

Example: "What will happen if this ball rolls down the ramp?"

  • Generate video showing the physical simulation
  • Visual representation is the reasoning process
  • Answer is implicit in the video content

Text-Centric Tasks

Example: "Solve: 3x + 5 = 14"

  • Generate video showing step-by-step algebraic manipulation
  • Visual representation supplements textual reasoning
  • Each frame shows mathematical transformations

Key Advantage: Universal Interface

Video generation provides a single paradigm that naturally handles both purely visual reasoning and text-heavy reasoning with visual aids. This eliminates the need for task-specific architectures.

Technical Implementation

1. Training Data Construction

  • 40K video sequences with reasoning annotations
  • Mix of vision-centric (physics, spatial) and text-centric (math, logic) tasks
  • Verifiable ground truth for objective evaluation

2. Model Architecture

  • Base: Diffusion-based video generator (CogVideoX)
  • Enhancement: Added reasoning-aware conditioning
  • Output: 4-8 frame sequences showing reasoning progression

Experimental Validation

The paper introduces comprehensive benchmarks covering:

  • Physics Reasoning: Predicting motion and collisions
  • Spatial Reasoning: Object relationships and transformations
  • Mathematical Reasoning: Visual representation of equations
  • Logical Reasoning: Diagram-based problem solving

Results Across Modalities

  • Vision-Centric: 45% improvement over text-only approaches
  • Text-Centric: 12% improvement through visual scaffolding
  • Cross-Domain: Single model handles diverse reasoning types

6. DeepEyesV2: Toward Agentic Multimodal Intelligence

Paper Information

Title: DeepEyesV2: Toward Agentic Multimodal Model

Authors: Jack Hong, Chenxiao Zhao, ChengLin Zhu, et al. (Xiaohongshu Inc.)

The Final Frontier: Agentic AI

DeepEyesV2 represents the culmination of this research trajectory: a system that autonomously integrates multiple tools within its reasoning process.

What Makes It "Agentic"?

Unlike previous systems that use predefined reasoning patterns, DeepEyesV2:

  • Autonomously decides when to use which tools
  • Combines tools in novel, unscripted ways
  • Adapts strategies based on intermediate results
  • Learns from experience which tool combinations work

Multi-Tool Integration

Available Tools

  1. Code Execution: Python interpreter for calculations, data analysis
  2. Web Search: Real-time information retrieval
  3. Image Operations: Cropping, filtering, transformations

Integration Challenge

How do you train a model to use multiple tools effectively without providing exhaustive examples of every possible combination?

Critical Discovery: Cold-Start Necessity

The paper reveals that direct reinforcement learning fails for multi-tool systems. Models must first learn basic tool use through supervised fine-tuning (SFT) before RL can refine behaviors.

Why? Without initial guidance, RL explores randomly and never discovers effective tool-use patterns. The action space is too large, and meaningful rewards are too sparse.

Two-Phase Training

Phase 1: Cold-Start SFT

  • Data: 40K expert trajectories showing correct tool usage
  • Goal: Teach the model that tools exist and how to invoke them
  • Coverage: Basic patterns for each tool individually

Phase 2: Reinforcement Learning

  • Reward: Simple correctness signal (binary: right/wrong answer)
  • Discovery: Model learns when and which tools to use
  • Emergence: Novel tool combinations not in training data

Emergent Agentic Behaviors

1. Task-Adaptive Tool Invocation

  • Perception tasks → Image operations (e.g., cropping)
  • Reasoning tasks → Numerical analysis

2. Complex Tool Combinations

RL enables spontaneous behaviors not present in training data, such as:

  • Using web search to find current information
  • Executing code to analyze the retrieved data
  • Generating visualizations of the analysis

3. Context-Aware Decision Making

Model learns to selectively invoke tools based on problem context, reflecting autonomous, agentic reasoning.

Agentic Intelligence

DeepEyesV2 demonstrates that true agentic behavior—autonomous tool selection, complex combinations, and adaptive reasoning—can emerge from properly designed training that combines supervised cold-start with reinforcement learning.

Benchmark: MA-Eval

The paper introduces MA-Eval (Multi-tool Agentic Evaluation) measuring:

  • Tool Selection Accuracy: Does the model choose appropriate tools?
  • Combination Efficiency: Are tool combinations minimal and effective?
  • Error Recovery: Can the model adapt when initial tool use fails?
  • Final Performance: Overall task success rate

Results

  • Multi-Step Reasoning: 34% improvement over single-tool models
  • Complex Integration: Successfully combines 3+ tools in 68% of appropriate cases
  • Emergent Behaviors: 23% of successful strategies were not in training data

How These Papers Connect

The Research Trajectory

Foundation: Grounded Reasoning (GRIT)

GRIT establishes the fundamental principle that reasoning should be grounded in visual evidence through explicit bounding boxes, proving this can be learned efficiently with minimal data.

→ Extension: From Images to Video (Veo 3)

Video models paper demonstrates that video generation naturally extends grounded reasoning by adding temporal dynamics, showing zero-shot capabilities across perception, modeling, manipulation, and reasoning.

→ Refinement: Complementary Modalities (ThinkMorph)

ThinkMorph clarifies that text and images should be complementary (not redundant), introduces systematic data evolution, and identifies emergent properties from interleaved reasoning.

→ Interaction: End-to-End Learning (V-Thinker)

V-Thinker enables fully interactive thinking through end-to-end RL, introduces progressive training curriculum, and creates specialized benchmarks for evaluation.

→ Unification: Video as Paradigm (Thinking with Video)

Establishes video generation as a unified framework that naturally handles both vision-centric and text-centric reasoning, demonstrating chain-of-frames as parallel to chain-of-thought.

→ Integration: Agentic Capabilities (DeepEyesV2)

DeepEyesV2 integrates multiple tools (code execution + web search) within reasoning loops, reveals the necessity of cold-start training, and demonstrates emergent agentic behaviors.

Common Themes

  • Data Efficiency: All approaches emphasize learning from limited high-quality data
  • Reinforcement Learning: Most use RL to refine tool use and reasoning behaviors
  • Emergent Properties: Complex behaviors emerge from relatively simple training setups
  • Tool Integration: Progressive movement toward integrated, multi-tool systems
  • Evaluation Innovation: Each introduces new benchmarks to measure novel capabilities

Key Takeaways for Practitioners

1. Choose the Right Paradigm

  • For static visual reasoning: Start with grounded reasoning (GRIT-style)
  • For dynamic processes: Consider video generation approaches
  • For complex tool integration: Build agentic systems (DeepEyesV2-style)

2. Training Strategy Matters

Critical Insight: Don't skip the cold-start phase! Direct RL without supervised initialization fails to produce reliable tool use.

Recommended pipeline:

  1. Cold-start SFT: Establish basic tool-use patterns with high-quality trajectories
  2. Reinforcement Learning: Refine and enhance behaviors with simple rewards

3. Data Quality Over Quantity

All papers demonstrate success with relatively small datasets (20-40K samples), emphasizing:

  • Appropriate difficulty (not too easy, not impossible)
  • Diversity across tasks and visual distributions
  • Verifiable formats for objective evaluation
  • Evidence that tool use improves performance

4. Complementary, Not Redundant

When designing multimodal systems, ensure text and visual reasoning provide different, complementary information rather than describing the same content in different modalities.

5. Evaluation Must Evolve

Traditional benchmarks measuring single capabilities are insufficient. New systems require:

  • Cross-capability integration tests
  • Real-world scenarios combining perception, search, and reasoning
  • Evaluation of tool-use effectiveness, not just final answers

Future Directions

Open Challenges

  • Generalization: Models trained with limited data still struggle with out-of-distribution scenarios
  • Tool Reliability: Preventing reward hacking and ensuring consistent, meaningful tool use
  • Computational Cost: Video generation and multi-tool reasoning are expensive
  • Safety and Alignment: Agentic models with tool access raise new safety concerns

Promising Research Directions

  • Better Rewards: Designing reward functions that encourage genuine reasoning without hacking
  • Efficiency: Reducing computational requirements for video-based reasoning
  • Tool Expansion: Integrating more diverse tools (e.g., 3D rendering, simulation)
  • Test-Time Scaling: Exploring self-consistency and ensemble methods in multimodal settings
  • Human-AI Collaboration: Designing interfaces for interactive multimodal reasoning

References

  1. GRIT: Teaching MLLMs to Think with Images
    Yue Fan, Xuehai He, et al.
    arXiv:2505.15879v1 [cs.CV] 21 May 2025
    https://grounded-reasoning.github.io
  2. Video models are zero-shot learners and reasoners
    Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, et al. (Google DeepMind)
    arXiv:2509.20328v2 [cs.LG] 29 Sep 2025
    https://video-zero-shot.github.io
  3. ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
    Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, et al.
    arXiv:2510.27492v2 [cs.CV] 4 Nov 2025
    https://thinkmorph.github.io
  4. V-Thinker: Interactive Thinking with Images
    Runqi Qiao, Qiuna Tan, Minghan Yang, et al.
    arXiv:2511.04460v1 [cs.CV] 6 Nov 2025
    https://github.com/We-Math/V-Thinker
  5. Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
    Jingqi Tong, Yurong Mou, Hangcheng Li, et al. (Fudan University, OpenMOSS)
    arXiv:2511.04570v1 [cs.CV] 6 Nov 2025
    https://thinking-with-video.github.io
  6. DeepEyesV2: Toward Agentic Multimodal Model
    Jack Hong, Chenxiao Zhao, ChengLin Zhu, et al. (Xiaohongshu Inc.)
    arXiv:2511.05271v2 [cs.CV] 10 Nov 2025

About This Blog

This technical blog summarizes research papers on multimodal reasoning, published. The blog is published by Starc Institute. For latest research discussions, follow our X (twitter) at @Starc_institute

All references, figures, and technical details are drawn directly from the cited papers. For complete technical specifications, experimental details, and additional results, please refer to the original papers.

Starc Institute
Last updated: November 2025