From Grounded Thinking to Agentic AI: A Technical Journey
by Starc Institute, follow us at X (Twitter) @Starc_institute
From Grounded Thinking to Agentic AI: A Technical Journey
by Starc Institute, follow us at X (Twitter) @Starc_institute
Recent advances in artificial intelligence have shown that multimodal models—systems that can process both text and images—are becoming increasingly sophisticated. However, a fundamental question remains: Can these models truly "think" with images, or are they simply describing what they see?
This blog explores six groundbreaking research papers published in late 2025 that collectively address this question. These papers represent a paradigm shift from simple image understanding to genuine visual reasoning, where models actively interact with images, generate visual content, and use multiple tools to solve complex problems.
Title: GRIT: Teaching MLLMs to Think with Images
Authors: Yue Fan, Xuehai He, et al.
Traditional multimodal models primarily use images as input, generating only text responses. This raises a critical question: Are these models truly performing visual reasoning, or are they merely describing images verbally?
GRIT introduces grounded reasoning, where the model's thought process is anchored in visual evidence through explicit bounding boxes. Instead of purely textual reasoning, the model:
Rather than manual annotation, GRIT uses GPT-4V to generate training data by:
GRIT achieves strong performance with just 20K grounded reasoning examples, demonstrating that targeted, high-quality data can be more effective than massive datasets.
On spatial reasoning benchmarks, GRIT-enhanced models show substantial improvements:
GRIT establishes that multimodal reasoning should be grounded in visual evidence, not just textual descriptions. This principle becomes foundational for all subsequent work in this area.
Title: Video models are zero-shot learners and reasoners
Authors: Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, et al. (Google DeepMind)
This paper proposes a radical idea: video generation models (like Veo 3) can function as zero-shot visual reasoners without any task-specific training.
Instead of generating text descriptions, use the model's ability to generate video sequences to demonstrate understanding. This allows evaluation across four capability dimensions:
Rather than answering "What color is the car?" with text, the model generates a video showing the car's color through visual transformation.
Given initial frames, the model generates plausible continuations, demonstrating understanding of physics and dynamics.
For problems requiring inference (e.g., "Which way will the object fall?"), the model generates video sequences showing the outcome.
Video models trained solely on generation naturally acquire reasoning abilities. This suggests video generation is a more general form of visual understanding than static image analysis.
Title: ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Authors: Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, et al.
How should text and images be combined during reasoning? Should images simply illustrate textual reasoning, or should they provide distinct, complementary information?
ThinkMorph establishes that effective multimodal reasoning requires:
ThinkMorph introduces a systematic approach to creating high-quality training data:
Human-designed prompts → GPT-4 generates text-only reasoning chains
GPT-4V identifies where visual reasoning would be beneficial and inserts image generation prompts at strategic points
DALL-E 3 generates images that complement textual reasoning
Models are trained incrementally, and data that improves performance is retained for the next iteration
When trained on appropriately designed interleaved data, models exhibit:
ThinkMorph models show superior performance on:
Title: V-Thinker: Interactive Thinking with Images
Authors: Runqi Qiao, Qiuna Tan, Minghan Yang, et al.
Previous approaches generated images as part of reasoning chains, but couldn't interact with them. V-Thinker introduces fully interactive visual thinking through end-to-end reinforcement learning.
V-Thinker enables models to:
Unlike supervised learning, V-Thinker uses Direct Policy Optimization (DPO) where:
Supervised learning can teach when to generate images, but RL teaches how to use them effectively. The model learns through trial and error which visual representations lead to correct reasoning.
V-Thinker uses a carefully designed progression:
The paper introduces VSTaR (Visual Self-Teaching through Reinforcement), evaluating:
Title: Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Authors: Jingqi Tong, Yurong Mou, Hangcheng Li, et al. (Fudan University, OpenMOSS)
This paper synthesizes previous insights into a single paradigm: video generation as the natural medium for multimodal reasoning.
Just as chain-of-thought breaks text reasoning into steps, chain-of-frames breaks visual reasoning into temporal sequences:
Example: "What will happen if this ball rolls down the ramp?"
Example: "Solve: 3x + 5 = 14"
Video generation provides a single paradigm that naturally handles both purely visual reasoning and text-heavy reasoning with visual aids. This eliminates the need for task-specific architectures.
The paper introduces comprehensive benchmarks covering:
Title: DeepEyesV2: Toward Agentic Multimodal Model
Authors: Jack Hong, Chenxiao Zhao, ChengLin Zhu, et al. (Xiaohongshu Inc.)
DeepEyesV2 represents the culmination of this research trajectory: a system that autonomously integrates multiple tools within its reasoning process.
Unlike previous systems that use predefined reasoning patterns, DeepEyesV2:
How do you train a model to use multiple tools effectively without providing exhaustive examples of every possible combination?
The paper reveals that direct reinforcement learning fails for multi-tool systems. Models must first learn basic tool use through supervised fine-tuning (SFT) before RL can refine behaviors.
Why? Without initial guidance, RL explores randomly and never discovers effective tool-use patterns. The action space is too large, and meaningful rewards are too sparse.
RL enables spontaneous behaviors not present in training data, such as:
Model learns to selectively invoke tools based on problem context, reflecting autonomous, agentic reasoning.
DeepEyesV2 demonstrates that true agentic behavior—autonomous tool selection, complex combinations, and adaptive reasoning—can emerge from properly designed training that combines supervised cold-start with reinforcement learning.
The paper introduces MA-Eval (Multi-tool Agentic Evaluation) measuring:
Foundation: Grounded Reasoning (GRIT)
GRIT establishes the fundamental principle that reasoning should be grounded in visual evidence through explicit bounding boxes, proving this can be learned efficiently with minimal data.
→ Extension: From Images to Video (Veo 3)
Video models paper demonstrates that video generation naturally extends grounded reasoning by adding temporal dynamics, showing zero-shot capabilities across perception, modeling, manipulation, and reasoning.
→ Refinement: Complementary Modalities (ThinkMorph)
ThinkMorph clarifies that text and images should be complementary (not redundant), introduces systematic data evolution, and identifies emergent properties from interleaved reasoning.
→ Interaction: End-to-End Learning (V-Thinker)
V-Thinker enables fully interactive thinking through end-to-end RL, introduces progressive training curriculum, and creates specialized benchmarks for evaluation.
→ Unification: Video as Paradigm (Thinking with Video)
Establishes video generation as a unified framework that naturally handles both vision-centric and text-centric reasoning, demonstrating chain-of-frames as parallel to chain-of-thought.
→ Integration: Agentic Capabilities (DeepEyesV2)
DeepEyesV2 integrates multiple tools (code execution + web search) within reasoning loops, reveals the necessity of cold-start training, and demonstrates emergent agentic behaviors.
Recommended pipeline:
All papers demonstrate success with relatively small datasets (20-40K samples), emphasizing:
When designing multimodal systems, ensure text and visual reasoning provide different, complementary information rather than describing the same content in different modalities.
Traditional benchmarks measuring single capabilities are insufficient. New systems require:
This technical blog summarizes research papers on multimodal reasoning, published. The blog is published by Starc Institute. For latest research discussions, follow our X (twitter) at @Starc_institute
All references, figures, and technical details are drawn directly from the cited papers. For complete technical specifications, experimental details, and additional results, please refer to the original papers.
Starc Institute
Last updated: November 2025