| AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning |
- |
- |
Audio / Video |
2025-11 |
 |
| Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding |
- |
- |
Video |
2025-11 |
 |
| Video Spatial Reasoning with Object-Centric 3D Rollout |
- |
- |
Video |
2025-11 |
 |
| ViSS-R1: Self-Supervised Reinforcement Video Reasoning |
- |
- |
Text / Video |
2025-11 |
 |
| Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning |
github |
HuggingFace |
Text / Video |
2025-10 |
 |
| Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence |
github |
HuggingFace |
Text / Video |
2025-10 |
 |
| VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception |
- |
- |
Text / Video |
2025-09 |
 |
| MOSS-ChatV: Reinforcement Learning with Process Reasoning Reward for Video Temporal Reasoning |
- |
- |
Text / Video |
2025-09 |
 |
| ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding |
- |
- |
Text / Video |
2025-09 |
 |
| Kwai Keye-VL 1.5 Technical Report |
- |
- |
Text / Video |
2025-09 |
 |
| Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data |
- |
- |
Text / Video |
2025-09 |
 |
| Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding |
- |
- |
Text / Video |
2025-08 |
 |
| Ovis2.5 Technical Report |
- |
- |
Text / Video |
2025-08 |
 |
| ReasoningTrack: Chain-of-Thought Reasoning for Long-term Vision-Language Tracking |
- |
- |
Text / Video |
2025-08 |
 |
| TAR-TVG: Enhancing VLMs with Timestamp Anchor-Constrained Reasoning for Temporal Video Grounding |
- |
- |
Text / Video |
2025-08 |
 |
| Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning |
- |
- |
Text / Video |
2025-08 |
 |
| AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video |
- |
- |
Audio / Video / Text |
2025-08 |
 |
| ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models |
- |
- |
Text / Video |
2025-08 |
 |
| VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering |
- |
- |
Text / Video |
2025-08 |
 |
| ARC-Hunyuan-Video-7B: Structured Video Comprehension of Real-World Shorts |
- |
- |
Text / Video |
2025-07 |
 |
| METER: Multi-modal Evidence-based Thinking and Explainable Reasoning -- Algorithm and Benchmark |
- |
- |
Text / Video |
2025-07 |
 |
| CoTasks: Chain-of-Thought based Video Instruction Tuning Tasks |
- |
- |
Text / Video |
2025-07 |
 |
| EmbRACE-3K: Embodied Reasoning and Action in Complex Environments |
- |
- |
Text / Video |
2025-07 |
 |
| Scaling RL to Long Videos |
- |
HuggingFace |
Text / Video |
2025-07 |
 |
| Video Event Reasoning and Prediction by Fusing World Knowledge from LLMs with Vision Foundation Models |
- |
- |
Text / Video |
2025-07 |
 |
| Kwai Keye-VL Technical Report |
- |
- |
Text / Video |
2025-07 |
 |
| EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent |
- |
- |
Text / Video |
2025-07 |
 |
| Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding |
- |
- |
Text / Video |
2025-07 |
 |
| ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models |
- |
- |
Text / Video |
2025-07 |
 |
| Video-RTS: Rethinking Reinforcement Learning and Test-Time Scaling for Efficient and Enhanced Video Reasoning |
- |
- |
Text / Video |
2025-07 |
-blue.svg?style=plastic) |
| Temporal Chain of Thought: Long-Video Understanding by Thinking in Frames |
- |
- |
Text / Video |
2025-07 |
 |
| VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning |
- |
- |
Text / Video |
2025-06 |
 |
| Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning |
Ego-R1  |
- |
Text / Video |
2025-06 |
 |
| DAVID-XR1: Detecting AI-Generated Videos with Explainable Reasoning |
- |
- |
Text / Video |
2025-06 |
 |
| VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks |
- |
HuggingFace |
Text / Video |
2025-06 |
 |
| MiMo-VL Technical Report |
- |
- |
Text / Video |
2025-06 |
 |
| Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning |
- |
- |
Text / Video |
2025-06 |
-blue.svg?style=plastic) |
| EgoVLM: Policy Optimization for Egocentric Video Understanding |
- |
HuggingFace |
Text / Video |
2025-06 |
 |
| Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency |
- |
- |
Text / Video |
2025-06 |
 |
| VideoCap-R1: Enhancing MLLMs for Video Captioning via Structured Thinking |
- |
- |
Text / Video |
2025-06 |
 |
| ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding |
- |
- |
Text / Video |
2025-06 |
 |
| ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding |
- |
- |
Text / Video |
2025-06 |
 |
| DIVE: Deep-search Iterative Video Exploration |
- |
- |
Text / Video |
2025-06 |
 |
| How Far Can Off-the-Shelf Multimodal Large Language Models Go in Online Episodic Memory Question Answering? |
- |
- |
Text / Video |
2025-06 |
 |
| VideoDeepResearch: Long Video Understanding With Agentic Tool Using |
- |
- |
Text / Video |
2025-06 |
 |
| Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency |
- |
- |
Text / Video |
2025-06 |
 |
| DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO |
Github |
- |
Text / Video |
2025-06 |
 |
| Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought |
- |
Project Page |
Text / Video |
2025-06 |
 |
| VideoChat-A1: Thinking with Long Videos by Chain-of-Shot Reasoning |
- |
- |
Text / Video |
2025-06 |
 |
| Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning |
- |
- |
Text / Video |
2025-06 |
 |
| Reinforcing Video Reasoning with Focused Thinking |
- |
- |
Text / Video |
2025-05 |
 |
| A2Seek: Towards Reasoning-Centric Benchmark for Aerial Anomaly Understanding |
- |
- |
Text / Video |
2025-05 |
 |
| Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration |
- |
- |
Text / Video |
2025-05 |
 |
| Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought |
- |
- |
Text / Video |
2025-05 |
 |
| VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Guided Iterative Policy Optimization |
- |
- |
Text / Video |
2025-05 |
 |
| Fact-R1: Towards Explainable Video Misinformation Detection with Deep Reasoning |
- |
- |
Text / Video |
2025-05 |
 |
| Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning |
- |
- |
Text / Video |
2025-05 |
 |
| UniVG-R1: Reasoning Guided Universal Visual Grounding with Reinforcement Learning |
- |
HuggingFace |
Text / Video |
2025-05 |
 |
| VideoRFT: Incentivizing Video Reasoning Capability in MLLMs via Reinforced Fine-Tuning |
- |
HuggingFace |
Text / Video |
2025-05 |
 |
| Seed1.5-VL Technical Report |
- |
- |
Text / Video |
2025-05 |
 |
| TEMPURA: Temporal Event Masked Prediction and Understanding for Reasoning in Action |
- |
- |
Text / Video |
2025-05 |
 |
| Fostering Video Reasoning via Next-Event Prediction |
- |
HuggingFace |
Text / Video |
2025-05 |
 |
| SiLVR: A Simple Language-based Video Reasoning Framework |
- |
- |
Text / Video |
2025-05 |
 |
| VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? |
- |
HuggingFace |
Text / Video |
2025-05 |
 |
| RVTBench: A Benchmark for Visual Reasoning Tasks |
- |
HuggingFace |
Text / Video |
2025-05 |
 |
| CoT-Vid: Dynamic Chain-of-Thought Routing with Self-Verification for Training-Free Video Reasoning |
- |
- |
Text / Video |
2025-05 |
 |
| VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models |
- |
- |
Text / Video |
2025-05 |
 |
| Empowering Agentic Video Analytics Systems with Video Language Models |
- |
- |
Text / Video |
2025-05 |
 |
| TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning |
- |
- |
Text / Video |
2025-04 |
 |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning |
- |
- |
Text / Video |
2025-04 |
 |
| Spatial-R1: Enhancing MLLMs in Video Spatial Reasoning |
- |
HuggingFace |
Text / Video |
2025-04 |
 |
| Improved Visual-Spatial Reasoning via R1-Zero-Like Training |
- |
- |
Text / Video |
2025-04 |
 |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models |
- |
- |
Text / Video |
2025-04 |
 |
| LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding |
- |
- |
Text / Video |
2025-04 |
 |
| From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models |
- |
- |
Text / Video |
2025-04 |
 |
| MR. Video: "MapReduce" is the Principle for Long Video Understanding |
- |
- |
Text / Video |
2025-04 |
 |
| Multimodal Long Video Modeling Based on Temporal Dynamic Context |
- |
- |
Text / Video |
2025-04 |
 |
| WikiVideo: Article Generation from Multiple Videos |
- |
- |
Text / Video |
2025-04 |
 |
| VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning |
- |
HuggingFace |
Text / Video |
2025-04 |
 |
| Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 |
- |
HuggingFace |
Text / Video |
2025-03 |
 |
| Video-R1: Reinforcing Video Reasoning in MLLMs |
- |
HuggingFace |
Text / Video |
2025-03 |
 |
| TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM |
- |
HuggingFace |
Text / Video |
2025-03 |
 |
| ST-Think: How Multimodal Large Language Models Reason About 4D Worlds from Ego-Centric Videos |
- |
- |
Text / Video |
2025-03 |
 |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning |
- |
HuggingFace |
Text / Video |
2025-03 |
 |
| Aurelia: Test-time Reasoning Distillation in Audio-Visual LLMs |
- |
- |
Audio / Video / Text |
2025-03 |
 |
| video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model |
- |
- |
Audio / Video / Text |
2025-02 |
 |
| TheoremExplainAgent: Towards Multimodal Explanations for LLM Theorem Understanding |
- |
HuggingFace |
Text / Video |
2025-02 |
-blue.svg?style=plastic) |
| CoS: Chain-of-Shot Prompting for Long Video Understanding |
- |
- |
Text / Video |
2025-02 |
 |
| Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray |
- |
- |
Text / Video |
2025-02 |
 |
| Temporal Preference Optimization for Long-Form Video Understanding |
- |
- |
Text / Video |
2025-01 |
 |
| InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model |
- |
- |
Text / Video |
2025-01 |
-blue.svg?style=plastic) |
| MECD+: Unlocking Event-Level Causal Graph Discovery for Video Reasoning |
- |
- |
Text / Video |
2025-01 |
 |
| Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs |
- |
- |
Text / Video |
2025-01 |
 |
| Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition |
- |
- |
Text / Video |
2025-01 |
ICML 2024 Oral |
| Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces |
- |
- |
Text / Video |
2024-12 |
 |
| Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling |
- |
HuggingFace |
Text / Video |
2024-12 |
 |
| STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training |
- |
- |
Text / Video |
2024-12 |
 |
| PruneVid: Visual Token Pruning for Efficient Video Large Language Models |
- |
- |
Text / Video |
2024-12 |
 |
| VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection |
VideoEspresso  |
HuggingFace |
Text / Video |
2024-11 |
 |
| Adaptive Video Understanding Agent: Enhancing efficiency with dynamic frame sampling and feedback-driven reasoning |
- |
- |
Text / Video |
2024-10 |
 |
| VideoINSTA: Zero-shot Long Video Understanding via Informative Spatial-Temporal Reasoning with LLMs |
- |
- |
Text / Video |
2024-09 |
 |
| MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning |
- |
HuggingFace |
Text / Video |
2024-09 |
-blue.svg?style=plastic) |
| Veason-R1: Reinforcing Video Reasoning Segmentation to Think Before It Segments |
- |
- |
Text / Video |
2024-07 |
 |