Awesome-Video-Reasoning-Landscape

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

Overview

This Awesome list systematically curates and tracks the latest progress in Video Reasoning, covering diverse modalities, tasks, and modeling paradigms. Rather than focusing on a single line of research, we organize the landscape from multiple complementary perspectives. Following the emerging taxonomy of the field, current works are grouped into four major paradigms:

🗒️ CoT-based Video Reasoning — language-centric, chain-of-thought reasoning with Video-LMMs
🕹️ CoF-based Video Reasoning — vision-centric reasoning grounded in world models or video generation
🌈 Interleaved Video Reasoning — unified models that integrate multimodal interaction and iterative inference
🔁 Streaming Video Reasoning — continuous, low-latency reasoning over long or unbounded video streams with online perception and incremental state updates.

We additionally maintain a dedicated Benchmark section that summarizes datasets, evaluation settings, and standardized tasks to support fair comparison across paradigms.

This repository aims to provide a structured, up-to-date, and open-source overview of the evolving landscape of video reasoning.
Contributions and PRs are warmly welcome — preferably in reverse chronological order (newest first) to keep the list fresh and easy to browse.

Awesome-Video-Reasoning-Landscape
- 📑 Task Definition
- 😎 Paradigms
  - 🗒️ CoT-based Video Reasoning
  - 🕹️ CoF-based Video Reasoning
  - 🌈 Interleaved Video Reasoning
  - 🔁 Streaming Video Reasoning
- ✨️ Benchmarks
- 🌟 Star History
- ♥️ Contributors

📑 Task Definition

TBD

😎 Paradigms

🕹️ CoT-based Video Reasoning

Title	Model & Code	Checkpoint	Input Modalities	Time	Venue
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning	-	-	Audio / Video	2025-11
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding	-	-	Video	2025-11
Video Spatial Reasoning with Object-Centric 3D Rollout	-	-	Video	2025-11
ViSS-R1: Self-Supervised Reinforcement Video Reasoning	-	-	Text / Video	2025-11
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning	github	HuggingFace	Text / Video	2025-10
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence	github	HuggingFace	Text / Video	2025-10
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception	-	-	Text / Video	2025-09
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning	-	-	Text / Video	2025-09
ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding	-	-	Text / Video	2025-09
Kwai Keye-VL 1.5 Technical Report	-	-	Text / Video	2025-09
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data	-	-	Text / Video	2025-09
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding	-	-	Text / Video	2025-08
Ovis2.5 Technical Report	-	-	Text / Video	2025-08
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	-	-	Text / Video	2025-08
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding	-	-	Text / Video	2025-08
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning	-	-	Text / Video	2025-08
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video	-	-	Audio / Video / Text	2025-08
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models	-	-	Text / Video	2025-08
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering	-	-	Text / Video	2025-08
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts	-	-	Text / Video	2025-07
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark	-	-	Text / Video	2025-07
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks	-	-	Text / Video	2025-07
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	-	-	Text / Video	2025-07
Scaling RL to Long Videos	-	HuggingFace	Text / Video	2025-07
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models	-	-	Text / Video	2025-07
Kwai Keye-VL Technical Report	-	-	Text / Video	2025-07
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent	-	-	Text / Video	2025-07
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding	-	-	Text / Video	2025-07
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models	-	-	Text / Video	2025-07
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning	-	-	Text / Video	2025-07
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames	-	-	Text / Video	2025-07
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning	-	-	Text / Video	2025-06
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning	Ego-R1	-	Text / Video	2025-06
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning	-	-	Text / Video	2025-06
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks	-	HuggingFace	Text / Video	2025-06
MiMo-VL Technical Report	-	-	Text / Video	2025-06
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning	-	-	Text / Video	2025-06
EgoVLM: Policy Optimization for Egocentric Video Understanding	-	HuggingFace	Text / Video	2025-06
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency	-	-	Text / Video	2025-06
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking	-	-	Text / Video	2025-06
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding	-	-	Text / Video	2025-06
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding	-	-	Text / Video	2025-06
DIVE: Deep-search Iterative Video Exploration	-	-	Text / Video	2025-06
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering?	-	-	Text / Video	2025-06
VideoDeepResearch: Long Video Understanding With Agentic Tool Using	-	-	Text / Video	2025-06
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency	-	-	Text / Video	2025-06
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO	Github	-	Text / Video	2025-06
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	-	Project Page	Text / Video	2025-06
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning	-	-	Text / Video	2025-06
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	-	-	Text / Video	2025-06
Reinforcing Video Reasoning with Focused Thinking	-	-	Text / Video	2025-05
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding	-	-	Text / Video	2025-05
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration	-	-	Text / Video	2025-05
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought	-	-	Text / Video	2025-05
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization	-	-	Text / Video	2025-05
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning	-	-	Text / Video	2025-05
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning	-	-	Text / Video	2025-05
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning	-	HuggingFace	Text / Video	2025-05
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning	-	HuggingFace	Text / Video	2025-05
Seed1.5-VL Technical Report	-	-	Text / Video	2025-05
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action	-	-	Text / Video	2025-05
Fostering Video Reasoning via Next-Event Prediction	-	HuggingFace	Text / Video	2025-05
SiLVR: A Simple Language-based Video Reasoning Framework	-	-	Text / Video	2025-05
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?	-	HuggingFace	Text / Video	2025-05
RVTBench: A Benchmark for Visual Reasoning Tasks	-	HuggingFace	Text / Video	2025-05
CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning	-	-	Text / Video	2025-05
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models	-	-	Text / Video	2025-05
Empowering Agentic Video Analytics Systems with Video Language Models	-	-	Text / Video	2025-05
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning	-	-	Text / Video	2025-04
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning	-	-	Text / Video	2025-04
Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning	-	HuggingFace	Text / Video	2025-04
Improved Visual-Spatial Reasoning via R1-Zero-Like Training	-	-	Text / Video	2025-04
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	-	-	Text / Video	2025-04
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding	-	-	Text / Video	2025-04
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models	-	-	Text / Video	2025-04
MR. Video: "MapReduce" is the Principle for Long Video Understanding	-	-	Text / Video	2025-04
Multimodal Long Video Modeling Based on Temporal Dynamic Context	-	-	Text / Video	2025-04
WikiVideo: Article Generation from Multiple Videos	-	-	Text / Video	2025-04
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning	-	HuggingFace	Text / Video	2025-04
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1	-	HuggingFace	Text / Video	2025-03
Video-R1: Reinforcing Video Reasoning in MLLMs	-	HuggingFace	Text / Video	2025-03
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM	-	HuggingFace	Text / Video	2025-03
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos	-	-	Text / Video	2025-03
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning	-	HuggingFace	Text / Video	2025-03
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs	-	-	Audio / Video / Text	2025-03
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model	-	-	Audio / Video / Text	2025-02
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding	-	HuggingFace	Text / Video	2025-02
CoS: Chain-of-Shot Prompting for Long Video Understanding	-	-	Text / Video	2025-02
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray	-	-	Text / Video	2025-02
Temporal Preference Optimization for Long-Form Video Understanding	-	-	Text / Video	2025-01
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	-	-	Text / Video	2025-01
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning	-	-	Text / Video	2025-01
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs	-	-	Text / Video	2025-01
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition	-	-	Text / Video	2025-01	ICML 2024 Oral
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces	-	-	Text / Video	2024-12
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling	-	HuggingFace	Text / Video	2024-12
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training	-	-	Text / Video	2024-12
PruneVid: Visual Token Pruning for Efficient Video Large Language Models	-	-	Text / Video	2024-12
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection	VideoEspresso	HuggingFace	Text / Video	2024-11
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning	-	-	Text / Video	2024-10
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs	-	-	Text / Video	2024-09
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning	-	HuggingFace	Text / Video	2024-09
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments	-	-	Text / Video	2024-07

🧙 CoF-based Video Reasoning

Title	Code	Checkpoint	Time
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO	GitHub	Hugging_Face	2025-11
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	GitHub	Hugging_Face	2025-11
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm	GitHub	-	2025-11
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark	GitHub	Hugging_Face	2025-10
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning	GitHub	-	2025-06

🌈 Interleaved Video Reasoning

Title	Code	Checkpoint	Time	Venue
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution	-	-	2025-11
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation	-	-	2024-09

🔁 Streaming Video Reasoning

Title	Code	Checkpoint	Time	Venue
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling	-	-	2025-07
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge	-	-	2025-01

✨️Benchmarks

Name	Paper	Link	Task	Time	Venue
V-ReasonBench	V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models	-	CoF-based	2025-11
VR-Bench	Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks	GitHub Hugging_Face	CoF-based	2025-11
Gen-ViRe	Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark	GitHub	CoF-based	2025-11
TiViBench	TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models	-	CoF-based	2025-11
VideoThinkBench	Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm	GitHub	CoF-based	2025-11
MME-CoF	Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark	Hugging_Face	CoF-based	2025-10
SciVideoBench	SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models		CoT-based	2025-10
Scaling RL to Long Videos	Long-RL: Scaling RL to Long Sequences	Hugging_Face	-	2025-07
ReasoningTrack	ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking	-	CoT-based	2025-08	-
METER	METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark	-	CoT-based	2025-07	-
Video-TT	Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding	-	CoT-based	2025-07	-
ImplicitQA	ImplicitQA: Going beyond frames towards Implicit Video Reasoning	Hugging_Face	CoT-based	2025-06	-
Video-CoT	Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	-	CoT-based	2025-06	-
Implicit-VideoQA	Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning	GitHub	CoT-based	2025-06	-
MORSE-500	MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning	Hugging_Face	CoT-based	2025-06	-
SpookyBench	Time Blindness: Why Video-Language Models Can't See What Humans Can	-	CoT-based	2025-05	-
Video-Holmes	Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?	GitHub	CoT-based	2025-05	-
VideoEval-Pro	Paper	-	-	2025-05	-
Breaking Down Video LLM Benchmarks	Paper	-	-	2025-05	-
RTV-Bench	Paper	GitHub Hugging_Face	Streaming	2025-05
MINERVA	Paper	GitHub	-	2025-05	-
VCR-Bench	Paper	GitHub Hugging_Face		2025-04	-
SEED-Bench-R1	Paper	GitHub Hugging_Face		2025-03	-
H2VU-Benchmark	Paper	-	-	2025-03	-
OmniMMI	Paper	GitHub Hugging_Face		2025-03
HAVEN	Paper	GitHub Hugging_Face		2025-03	-
V-STaR	Paper	GitHub Hugging_Face		2025-03	-
Reasoning is All You Need for Video Generalization	Paper	-	-	2025-03	-
Towards Fine-Grained Video Question Answering	Paper	-	-	2025-03	-
SVBench	Paper	-	Streaming	2025-02	-
MMVU	Paper	GitHub Hugging_Face	CoT-based	2025-01	-
OVO-Bench	Paper	GitHub ,Hugging_Face	Streaming	2025-01	-
HLV-1K	Paper	-	CoT-based	2025-01	-
Thinking in Space	Paper	-	CoT-based	2024-12	-
3DSRBench	Paper	-	CoT-based	2024-12	-
Black Swan	Paper	GitHub Hugging_Face		2024-12
TOMATO	Paper	-	CoT-based	2024-10	-
OmniBench	OmniBench: Towards the Future of Universal Omni-Language Models		CoT-based	2024-09
OmnixR	OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities		CoT-based	2024-10
TemporalBench	Paper	-	CoT-based	2024-10	-
VideoVista	VideoVista: A Versatile Benchmark for Video Understanding and Reasoning		CoT-based	2024-06
SOK-Bench	SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge		CoT-based	2024-05
CVRR-ES	How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs		CoT-based	2024-05

✈ Related Survey

In addition, several recent and concurrent surveys have discussed multimodal or video reasoning. The works listed below offer complementary perspectives to ours, reflecting the field’s rapid and parallel development:

Awesome-Omni-Large-Models-and-Datasets
Awesome-Omni-Large-Models-and-Datasets copied to clipboard

Metadata

Awesome-Video-Reasoning-Landscape

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarks— An Open-Source Survey

Overview

Table of Contents

📑 Task Definition