Awesome-Omni-Large-Models-and-Datasets icon indicating copy to clipboard operation
Awesome-Omni-Large-Models-and-Datasets copied to clipboard

πŸ”₯ Omni large models and datasets for understanding and generating multi-modalities.

Awesome-Video-Reasoning-Landscape Awesome

The Landscape of Video Reasoning: Tasks, Paradigms and Benchmarksβ€” An Open-Source Survey

Overview

This Awesome list systematically curates and tracks the latest progress in Video Reasoning, covering diverse modalities, tasks, and modeling paradigms. Rather than focusing on a single line of research, we organize the landscape from multiple complementary perspectives. Following the emerging taxonomy of the field, current works are grouped into four major paradigms:

  • πŸ—’οΈ CoT-based Video Reasoning β€” language-centric, chain-of-thought reasoning with Video-LMMs
  • πŸ•ΉοΈ CoF-based Video Reasoning β€” vision-centric reasoning grounded in world models or video generation
  • 🌈 Interleaved Video Reasoning β€” unified models that integrate multimodal interaction and iterative inference
  • πŸ” Streaming Video Reasoning β€” continuous, low-latency reasoning over long or unbounded video streams with online perception and incremental state updates.

We additionally maintain a dedicated Benchmark section that summarizes datasets, evaluation settings, and standardized tasks to support fair comparison across paradigms.

This repository aims to provide a structured, up-to-date, and open-source overview of the evolving landscape of video reasoning.
Contributions and PRs are warmly welcome β€” preferably in reverse chronological order (newest first) to keep the list fresh and easy to browse.

Table of Contents

  • Awesome-Video-Reasoning-Landscape
    • πŸ“‘ Task Definition
    • 😎 Paradigms
      • πŸ—’οΈ CoT-based Video Reasoning
      • πŸ•ΉοΈ CoF-based Video Reasoning
      • 🌈 Interleaved Video Reasoning
      • πŸ” Streaming Video Reasoning
    • ✨️ Benchmarks
    • 🌟 Star History
    • β™₯️ Contributors

πŸ“‘ Task Definition

TBD

😎 Paradigms

πŸ•ΉοΈ CoT-based Video Reasoning

Title Model & Code Checkpoint Input Modalities Time Venue
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning - - Audio / Video 2025-11 arXiv
Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding - - Video 2025-11 arXiv
Video Spatial Reasoning with Object-Centric 3D Rollout - - Video 2025-11 arXiv
ViSS-R1: Self-Supervised Reinforcement Video Reasoning - - Text / Video 2025-11 arXiv
Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning github HuggingFace Text / Video 2025-10 arXiv
Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence github HuggingFace Text / Video 2025-10 arXiv
VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception - - Text / Video 2025-09 arXiv
MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning - - Text / Video 2025-09 arXiv
ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding - - Text / Video 2025-09 arXiv
Kwai Keye-VL 1.5 Technical Report - - Text / Video 2025-09 arXiv
Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data - - Text / Video 2025-09 arXiv
Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding - - Text / Video 2025-08 arXiv
Ovis2.5 Technical Report - - Text / Video 2025-08 arXiv
ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking - - Text / Video 2025-08 arXiv
TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding - - Text / Video 2025-08 arXiv
Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning - - Text / Video 2025-08 arXiv
AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video - - Audio / Video / Text 2025-08 arXiv
ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models - - Text / Video 2025-08 arXiv
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering - - Text / Video 2025-08 arXiv
ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts - - Text / Video 2025-07 arXiv
METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark - - Text / Video 2025-07 arXiv
CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks - - Text / Video 2025-07 arXiv
EmbRACE-3K: Embodied Reasoning and Action in Complex Environments - - Text / Video 2025-07 arXiv
Scaling RL to Long Videos - HuggingFace Text / Video 2025-07
Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models - - Text / Video 2025-07 arXiv
Kwai Keye-VL Technical Report - - Text / Video 2025-07 arXiv
EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent - - Text / Video 2025-07 arXiv
Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding - - Text / Video 2025-07 arXiv
ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models - - Text / Video 2025-07 arXiv
Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning - - Text / Video 2025-07
Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames - - Text / Video 2025-07 arXiv
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning - - Text / Video 2025-06 arXiv
Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning Ego-R1 - Text / Video 2025-06 arXiv
DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning - - Text / Video 2025-06 arXiv
VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks - HuggingFace Text / Video 2025-06 arXiv
MiMo-VL Technical Report - - Text / Video 2025-06 arXiv
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning - - Text / Video 2025-06
EgoVLM: Policy Optimization for Egocentric Video Understanding - HuggingFace Text / Video 2025-06 arXiv
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency - - Text / Video 2025-06 arXiv
VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking - - Text / Video 2025-06 arXiv
ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding - - Text / Video 2025-06
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding - - Text / Video 2025-06 arXiv
DIVE: Deep-search Iterative Video Exploration - - Text / Video 2025-06 arXiv
How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? - - Text / Video 2025-06 arXiv
VideoDeepResearch: Long Video Understanding With Agentic Tool Using - - Text / Video 2025-06 arXiv
Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency - - Text / Video 2025-06 arXiv
DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO Github - Text / Video 2025-06 arXiv
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought - Project Page Text / Video 2025-06
VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning - - Text / Video 2025-06 arXiv
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning - - Text / Video 2025-06 arXiv
Reinforcing Video Reasoning with Focused Thinking - - Text / Video 2025-05 arXiv
A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding - - Text / Video 2025-05 arXiv
Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration - - Text / Video 2025-05 arXiv
Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought - - Text / Video 2025-05 arXiv
VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization - - Text / Video 2025-05 arXiv
Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning - - Text / Video 2025-05 arXiv
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning - - Text / Video 2025-05 arXiv
UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning - HuggingFace Text / Video 2025-05 arXiv
VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning - HuggingFace Text / Video 2025-05
Seed1.5-VL Technical Report - - Text / Video 2025-05 arXiv
TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action - - Text / Video 2025-05 arXiv
Fostering Video Reasoning via Next-Event Prediction - HuggingFace Text / Video 2025-05 arXiv
SiLVR: A Simple Language-based Video Reasoning Framework - - Text / Video 2025-05 arXiv
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? - HuggingFace Text / Video 2025-05 arXiv
RVTBench: A Benchmark for Visual Reasoning Tasks - HuggingFace Text / Video 2025-05 arXiv
CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning - - Text / Video 2025-05
VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models - - Text / Video 2025-05
Empowering Agentic Video Analytics Systems with Video Language Models - - Text / Video 2025-05 arXiv
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning - - Text / Video 2025-04 arXiv
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning - - Text / Video 2025-04 arXiv
Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning - HuggingFace Text / Video 2025-04 arXiv
Improved Visual-Spatial Reasoning via R1-Zero-Like Training - - Text / Video 2025-04 arXiv
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models - - Text / Video 2025-04 arXiv
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding - - Text / Video 2025-04 arXiv
From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models - - Text / Video 2025-04 arXiv
MR. Video: "MapReduce" is the Principle for Long Video Understanding - - Text / Video 2025-04 arXiv
Multimodal Long Video Modeling Based on Temporal Dynamic Context - - Text / Video 2025-04 arXiv
WikiVideo: Article Generation from Multiple Videos - - Text / Video 2025-04 arXiv
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning - HuggingFace Text / Video 2025-04
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 - HuggingFace Text / Video 2025-03 arXiv
Video-R1: Reinforcing Video Reasoning in MLLMs - HuggingFace Text / Video 2025-03 arXiv
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM - HuggingFace Text / Video 2025-03 arXiv
ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos - - Text / Video 2025-03 arXiv
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning - HuggingFace Text / Video 2025-03 arXiv
Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs - - Audio / Video / Text 2025-03 arXiv
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model - - Audio / Video / Text 2025-02 arXiv
TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding - HuggingFace Text / Video 2025-02
CoS: Chain-of-Shot Prompting for Long Video Understanding - - Text / Video 2025-02 arXiv
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray - - Text / Video 2025-02 arXiv
Temporal Preference Optimization for Long-Form Video Understanding - - Text / Video 2025-01 arXiv
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model - - Text / Video 2025-01
MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning - - Text / Video 2025-01 arXiv
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs - - Text / Video 2025-01 arXiv
Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition - - Text / Video 2025-01 ICML 2024 Oral
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces - - Text / Video 2024-12 arXiv
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling - HuggingFace Text / Video 2024-12 arXiv
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training - - Text / Video 2024-12 arXiv
PruneVid: Visual Token Pruning for Efficient Video Large Language Models - - Text / Video 2024-12 arXiv
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection VideoEspresso HuggingFace Text / Video 2024-11
Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning - - Text / Video 2024-10 arXiv
VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs - - Text / Video 2024-09 arXiv
MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning - HuggingFace Text / Video 2024-09
Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments - - Text / Video 2024-07 arXiv

πŸ§™ CoF-based Video Reasoning

Title Code Checkpoint Time Venue
Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO GitHub Hugging_Face 2025-11
Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks GitHub Hugging_Face 2025-11 arXiv
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm GitHub - 2025-11
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark GitHub Hugging_Face 2025-10
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning GitHub - 2025-06

🌈 Interleaved Video Reasoning

Title Code Checkpoint Time Venue
Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution - - 2025-11 arXiv
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation - - 2024-09 arXiv

πŸ” Streaming Video Reasoning

Title Code Checkpoint Time Venue
StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling - - 2025-07 arXiv
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge - - 2025-01

✨️Benchmarks

Name Paper Link Task Time Venue
V-ReasonBench V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models - CoF-based 2025-11
VR-Bench Reasoning via Video: The First Evaluation of Video Models’ Reasoning Abilities through Maze-Solving Tasks GitHub
Hugging_Face
CoF-based 2025-11 arXiv
Gen-ViRe Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark GitHub CoF-based 2025-11
TiViBench TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models - CoF-based 2025-11
VideoThinkBench Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm GitHub CoF-based 2025-11
MME-CoF Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark Hugging_Face CoF-based 2025-10
SciVideoBench SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models CoT-based 2025-10
Scaling RL to Long Videos Long-RL: Scaling RL to Long Sequences Hugging_Face - 2025-07
ReasoningTrack ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking - CoT-based 2025-08 -
METER METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark - CoT-based 2025-07 -
Video-TT Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding - CoT-based 2025-07 -
ImplicitQA ImplicitQA: Going beyond frames towards Implicit Video Reasoning Hugging_Face CoT-based 2025-06 -
Video-CoT Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought - CoT-based 2025-06 -
Implicit-VideoQA Looking Beyond Visible Cues: Implicit Video Question Answering via Dual-Clue Reasoning GitHub CoT-based 2025-06 -
MORSE-500 MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Hugging_Face CoT-based 2025-06 -
SpookyBench Time Blindness: Why Video-Language Models Can't See What Humans Can - CoT-based 2025-05 -
Video-Holmes Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning? GitHub CoT-based 2025-05 -
VideoEval-Pro Paper - - 2025-05 -
Breaking Down Video LLM Benchmarks Paper - - 2025-05 -
RTV-Bench Paper GitHub
Hugging_Face
Streaming 2025-05
MINERVA Paper GitHub - 2025-05 -
VCR-Bench Paper GitHub
Hugging_Face
2025-04 -
SEED-Bench-R1 Paper GitHub
Hugging_Face
2025-03 -
H2VU-Benchmark Paper - - 2025-03 -
OmniMMI Paper GitHub
Hugging_Face
2025-03
HAVEN Paper GitHub
Hugging_Face
2025-03 -
V-STaR Paper GitHub
Hugging_Face
2025-03 -
Reasoning is All You Need for Video Generalization Paper - - 2025-03 -
Towards Fine-Grained Video Question Answering Paper - - 2025-03 -
SVBench Paper - Streaming 2025-02 -
MMVU Paper GitHub
Hugging_Face
CoT-based 2025-01 -
OVO-Bench Paper GitHub ,Hugging_Face Streaming 2025-01 -
HLV-1K Paper - CoT-based 2025-01 -
Thinking in Space Paper - CoT-based 2024-12 -
3DSRBench Paper - CoT-based 2024-12 -
Black Swan Paper GitHub
Hugging_Face
2024-12
TOMATO Paper - CoT-based 2024-10 -
OmniBench OmniBench: Towards the Future of Universal Omni-Language Models CoT-based 2024-09
OmnixR OmnixR: Evaluating Omni-modality Language Models on Reasoning across Modalities CoT-based 2024-10
TemporalBench Paper - CoT-based 2024-10 -
VideoVista VideoVista: A Versatile Benchmark for Video Understanding and Reasoning CoT-based 2024-06
SOK-Bench SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge CoT-based 2024-05
CVRR-ES How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs CoT-based 2024-05

✈ Related Survey

In addition, several recent and concurrent surveys have discussed multimodal or video reasoning. The works listed below offer complementary perspectives to ours, reflecting the field’s rapid and parallel development:

🌟 Star History

Star History Chart

β™₯️ Contributors

Contributors for Awesome Video Reasoning Landscape