Awesome Visual Spatial Reasoning
Songsong Yu*1,2,
Yuxin Chen๐*2,
Hao Ju*3,
Lianjie Jia*4,
Fuxi Zhang4,
Shaofei Huang3,
Yuhan Wu4,
Rundi Cui4,
Binghao Ran4,
Zhang Zaibin4,
Zhedong Zheng3,
Zhipeng Zhang1,
Yifan Wang4,
Lin Song2,
Lijun Wang4,
Yanwei Liโ๏ธ5,
Ying Shan2,
Huchuan Lu4,
1SJTU,
2ARC Lab, Tencent PCG,
3UM,
4DLUT,
5CUHK
\* Equal Contributions
๐ Project Lead
โ๏ธ Corresponding Author
๐ค Dataset    |   ๐ Leaderboard   |   ๐ Survey    |   ๐ฏ Code    |   ๐ arXiv
News and Updates
- [x] ๐๐๐25.9.23 - Preprint a survey article on visual spatial reasoning tasks.
- [x] ๐ฏ๐ฏ๐ฏ25.9.23 - Release comprehensive evaluation results of mainstream models in visual spatial reasoning.
- [x] ๐๐๐25.9.15 - Open-source evaluation data for visual spatial reasoning tasks.
- [x] ๐คฉ๐ฅณ๐ค25.9.15 - Open-source evaluation toolkit.
- [x] โ๏ธ๐ฆพ๐ผ25.6.28 - Collected the "Datasets" section.
- [x] ๐๐โโ๏ธ๐โโ๏ธ25.6.16 - The "Awesome Visual Spatial Reasoning" project is now live!
- [x] ๐๐ฎ๐ป25.6.12 - The project has conducted research and collected 100 relevant works.
- [x] ๐โโ๏ธ๐โโ๏ธ๐25.6.10 - We launches a review project on visual spatial reasoning.
Open-source evaluation toolkit

Evaluation of SOTA Models on 23 Visual Spatial Reasoning Tasks.
Code Usage:
- git clone https://github.com/song2yu/SIBench-VSR.git
- Refer to the README.md for more details
Contributing
We welcome contributions to this repository! If you would like to contribute, please follow these steps:
- Fork the repository.
- Create a new branch with your changes.
- Submit a pull request with a clear description of your changes.
You can also open an issue if you have anything to add or comment.
Please feel free to contact us ([email protected]).
Overview
The research community is increasingly focused on the visual spatial reasoning (VSR) abilities of Vision-Language Models (VLMs). Yet, the field lacks a clear overview of its evolution and a standardized benchmark for evaluation. Current assessment methods are disparate and lack a common toolkit. This project aims to fill that void. We are developing a unified, comprehensive, and diverse evaluation toolkit, along with an accompanying survey paper. We are actively seeking collaboration and discussion with fellow experts to advance this initiative.
Task Explanation
Visual spatial understanding is a key task at the intersection of computer vision and cognitive science. It aims to enable intelligent agents (such as robots and AI systems) to parse spatial relationships in the environment through visual inputs (images, videos, etc.), forming an abstract cognition of the physical world. In Embodied Intelligence, it serves as the foundation for agents to achieve the "perception-decision-action" loopโonly by understanding attributes like object positions, distances, sizes, and orientations in space can intelligent agents navigate environments, manipulate objects, or interact with humans.
Timeline

Citation
If you find this project useful, please consider citing:
@article{sibench2025,
title={How Far are VLMs from True Visual Spatial Intelligence? A Benchmark-Driven Perspective},
author={Songsong Yu, Yuxin Chen, Hao Ju, Lianjie Jia, Fuxi Zhang, Shaofei Huang, Yuhan Wu, Rundi Cui, Binghao Ran, Zaibin Zhang, Zhedong Zheng, Zhipeng Zhang, Yifan Wang, Lin Song, Lijun Wang, Yanwei Li, Ying Shan, Huchuan Lu},
journal={arXiv preprint arXiv:2509.18905},
year={2025}
}
Table of Contents
To facilitate the community's quick understanding of visual-spatial reasoning, we first categorized it by input modalities into Single image, Monocular Video, and Multi-View Images. We also surveyed other input modalities such as point clouds, as well as specific applications like embodied robotics. These are temporarily grouped under "Others," and we will conduct a more detailed sorting in the future.
- Evaluation Results
- Papers
- Single Image
- Monocular Video
- Multi-View Images
- Others
- Datasets
Papers
Single Image
| Title |
Venue |
Date |
Code |
Stars |
Benchmark |
Illustration |
| R2D3:ImpartingSpatial Reasoning by Reconstructing 3D Scenes from 2D Images |
ARXIV |
-- |
-- |
-- |
R2D3 |
 |
| RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robtics |
ARXIV |
25-06 |
link |
-- |
RefSpatial-Bench |
 |
| Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces |
ARXIV |
25-06 |
link |
 |
VeBrain-600k |
 |
| SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization |
ARXIV |
25-06 |
-- |
-- |
SVQA-R1 |
 |
| OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models |
-- |
25-06 |
link |
 |
OmniSpatial |
 |
| Can Multimodal Large Language Models Understand Spatial Relations |
ARXIV |
25-05 |
link |
 |
SpatialMQA |
 |
| SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning |
-- |
25-05 |
-- |
-- |
SSR-CoT |
 |
| Spatial-LLaVA: Enhancing Large Language Models with Spatial Referring Expressions for Visual Understanding |
ARXIV |
25-05 |
link |
-- |
SUNSPOT |
 |
| Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? |
ARXIV |
25-05 |
link |
-- |
OSR-Bench |
 |
| SITE: towards Spatial Intelligence Thorough Evaluation |
ARXIV |
25-05 |
link |
 |
SITE |
 |
| Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning |
ARXIV |
25-05 |
-- |
-- |
TallyQA, V* InfographicVQA, MVBench |
 |
| Improved Visual-Spatial Reasoning via R1-Zero-Like Training |
ARXIV |
25-04 |
link |
 |
VSI-100K |
 |
| Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation |
ARXIV |
25-04 |
link |
 |
COMFORT++, 3DSRBench |
 |
| SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning |
ARXIV |
25-04 |
link |
-- |
-- |
 |
| Vision language models are unreliable at trivial spatial cognition |
ARXIV |
25-04 |
-- |
-- |
TableTest |
 |
| SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data |
ARXIV |
25-04 |
-- |
-- |
vsr, what's up 3DSR-Bench, RealWorldQA |
 |
| NUSCENES-SPATIALQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving |
ARXIV |
25-04 |
link |
 |
NuScenes-SpatialQA |
 |
| Beyond Semantics Rediscovering Spatial Awareness in Vision-Language Models |
-- |
25-03 |
link |
-- |
-- |
 |
| MetaSpatial: Reinforcing 3D Spatial Reasoning in
VLMs for the Metaverse |
ARXIV |
25-03 |
link |
 |
MetaSpatial |
 |
| Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models |
ARXIV |
25-03 |
link |
 |
SRBench |
 |
| Open3DVQA: ABenchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space |
ARXIV |
25-03 |
link |
 |
Open3DVQA |
 |
| AutoSpatial: Visual-Language Reasoning for Social Robot Navigation through Efficient Spatial Reasoning Learning |
ARXIV |
25-03 |
link |
 |
AutoSpatial |
 |
| Why Is Spatial Reasoning Hard for VLMs? AnAttention Mechanism Perspective on Focus Areas |
ARXIV |
25-03 |
link |
 |
-- |
 |
| LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning |
ARXIV |
25-03 |
link |
 |
LEGO-Puzzles |
 |
| Visual Agentic AI for Spatial Reasoning with a Dynamic API |
ARXIV |
25-02 |
link |
 |
-- |
 |
| iVISPAR โAnInteractive Visual-Spatial Reasoning Benchmark for VLMs |
ARXIV |
25-02 |
link |
 |
iVISPAR |
 |
| Visual Agentic AI for Spatial Reasoning with a Dynamic API |
ARXIV |
25-02 |
link |
 |
Q-Spatial Bench, VSI-Bench |
 |
| Defining and Evaluating Visual Language Modelsโ Basic Spatial Abilities: A Perspective from Psychometrics |
ARXIV |
25-02 |
-- |
-- |
BSA-Tests |
 |
| Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting |
NAACL |
25-02 |
link |
 |
ARO, GQA MMRel |
 |
| Do Vision-Language Models Represent Space and How. Evaluating Spatial Frame of Reference under Ambiguities |
ICLR |
25-01 |
link |
-- |
COMFORT |
 |
| ROBOSPATIAL: Teaching Spatial Understanding to 2D and 3D Vision-Language
Models for Robotics |
CVPR |
25-01 |
-- |
-- |
-- |
 |
| Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models |
CVPR |
25-01 |
-- |
-- |
-- |
 |
| COARSE CORRESPONDENCES Boost Spatial-Temporal Reasoning
in Multimodal Language Model |
CVPR |
25-01 |
link |
-- |
-- |
 |
| SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models |
CVPR |
25-01 |
link |
-- |
-- |
 |
| SpatialCLIP: Learning 3D-aware Image Representations from Spatially Discriminative Language |
CVPR |
25-01 |
link |
 |
SpatialBench |
 |
| SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning |
ARXIV |
25-01 |
-- |
-- |
-- |
 |
| ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning |
CVPR |
25-01 |
link |
-- |
ReasoningGD |
 |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought |
ARXIV |
25-01 |
-- |
-- |
LEC23, WMS+24 LZZ+24, RDT+24 |
 |
| LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations |
ARXIV |
24-12 |
link |
 |
SpaceSGG |
 |
| 3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark |
ARXIV |
24-12 |
link |
-- |
3DSRBench |
 |
| SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation |
ACL |
24-12 |
link |
 |
SPHERE |
 |
| TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation |
ARXIV |
24-11 |
-- |
-- |
-- |
 |
| AnEmpirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models |
EMNLP |
24-11 |
link |
 |
Spatial-MM |
 |
| ROOT: VLM-based System for Indoor Scene Understanding and Beyond |
ARXIV |
24-11 |
link |
 |
SceneVLM |
 |
| An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models |
EMNLP 2024 |
24-11 |
link |
 |
Spatial-MM, GQA-spatial |
 |
| GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks |
ARXIV |
24-11 |
link |
 |
GEOBench-VLM |
 |
| Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Spatial Reasoning |
NIPS |
24-10 |
-- |
-- |
what's up, coco-spatial GQA-spatial |
 |
| Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models |
ARXIV |
24-09 |
link |
 |
Q-Spatial Bench |
 |
| Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning? |
ARXIV |
24-09 |
link |
 |
SVAT |
 |
| Understanding Depth and Height Perception in Large Visual-Language Models |
CVPRW |
24-08 |
link |
 |
GeoMeter |
 |
| Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model |
-- |
24-08 |
-- |
-- |
ScanQA, OpenEQAโs episodic memory subset EgoSchema, R2R SQA3D |
 |
| VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs |
ARXIV |
24-07 |
link |
 |
VSP |
 |
| SpatialBot: Precise Spatial Understanding with Vision Language Models |
ICRA |
24-06 |
link |
 |
SpatialBench |
 |
| SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models |
ARXIV |
24-06 |
link |
 |
SpatialRGPT-Bench |
 |
| TOPVIEWRS: Vision-Language Models as Top-View Spatial Reasoners |
ARXIV |
24-06 |
link |
-- |
-- |
 |
| Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models |
NIPS |
24-06 |
link |
 |
SpatialEval |
 |
| EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models |
ARXIV |
24-06 |
link |
 |
EmbSpatial-Bench |
 |
| GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs |
NEURIPS 2024 WORKSHOP |
24-06 |
-- |
-- |
GSR-BENCH |
 |
| RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics |
CORL2024 |
24-06 |
link |
 |
RoboPoint |
 |
| Reframing Spatial Reasoning Evaluation in Language Models:A Real-World Simulation Benchmark for Qualitative Reasoning |
ARXIV |
24-05 |
-- |
-- |
RoomSpace, bAbI StepGame, SpartQA SpaRTUN |
 |
| RAG-Guided Large Language Models for Visual Spatial Description with Adaptive Hallucination Corrector |
ACMMM24 |
24-05 |
-- |
-- |
VSD |
 |
| Mind's Eye of LLMs: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models |
NIPS |
24-04 |
link |
 |
VoT |
 |
| BLINK: Multimodal Large Language Models Can See but Not Perceive |
ECCV |
24-04 |
link |
 |
BLINK |
 |
| Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning |
CVPR2024 |
24-04 |
link |
 |
KITTI-360 |
 |
| Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning |
-- |
24-03 |
link |
 |
Visual-CoT |
 |
| Can Transformers Capture Spatial Relations between Objects? |
ICLR |
24-03 |
link |
 |
SRP |
 |
| SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors |
NEURIPS2024 |
24-03 |
link |
 |
NOCS, RT-1 BridgeData V2, YCBInEOAT |
 |
| SpatialVLM Endowing Vision-Language Models with Spatial Reasoning Capabilities |
CVPR |
24-01 |
link |
 |
-- |
 |
| LLaVA-VSD: Large Language-and-Vision Assistant for Visual Spatial Description |
ACM MM |
24-01 |
-- |
-- |
-- |
 |
| Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis |
ARXIV |
24-01 |
link |
 |
Proximity-110K |
 |
| Improving Vision-and-Language Reasoning via Spatial Relations Modeling |
WACV |
23-11 |
-- |
-- |
-- |
 |
| 3D-Aware Visual Question Answering about Parts, Poses and Occlusions |
NIPS |
23-10 |
link |
 |
Super-CLEVR-3D |
 |
| Things not Written in Text: Exploring Spatial Commonsense from Visual Signals |
ACL2022 |
22-03 |
link |
 |
-- |
 |
Monocular-Video
Multi-View Images
Others
Datasets
Acknowledgements
๐ Visitor Statistics