Benchmark and Evaluations, RL Alignment, Applications, and Challenges of Large Vision Language Models
A most Frontend Collection and survey of vision-language model papers, and models GitHub repository
Below we compile awesome papers and model and github repositories that
- State-of-the-Art VLMs Collection of newest to oldest VLMs (we'll keep updating new models and benchmarks).
- Evaluate VLM benchmarks and corresponding link to the works
- Post-training/Alignment Newest related work for VLM alignment including RL, sft.
- Applications applications of VLMs in embodied AI, robotics, etc.
- Contribute surveys, perspectives, and datasets on the above topics.
Welcome to contribute and discuss!
🤩 Papers marked with a ⭐️ are contributed by the maintainers of this repository. If you find them useful, we would greatly appreciate it if you could give the repository a star or cite our paper.
Table of Contents
-
📄 Paper Link/⛑️ Citation
-
- 📚 SoTA VLMs
-
- 🗂️ Dataset and Evaluation
- 2.1. Large Scale Pre-Training & Post-Training Dataset
- 2.2. Datasets and Evaluation for VLM
- 2.3. Benchmark Datasets, Simulators and Generative Models for Embodied VLM
-
-
🔥 Post-Training/Alignment/prompt engineering 🔥
- 3.1. RL Alignment for VLM
- 3.2. Regular finetuning (SFT)
- 3.3. VLM Alignment Github
- 3.4. Prompt Engineering
-
- ⚒️ Applications
- 4.1. Embodied VLM agents
- 4.2. Generative Visual Media Applications
- 4.3. Robotics and Embodied AI
- 4.3.1. Manipulation
- 4.3.2. Navigation
- 4.3.3. Human-robot Interaction
- 4.3.4. Autonomous Driving
- 4.4. Human-Centered AI
- 4.4.1. Web Agent
- 4.4.2. Accessibility
- 4.4.3. Healthcare
- 4.4.4. Social Goodness
-
- ⛑️ Challenges
- 5.1. Hallucination
- 5.2. Safety
- 5.3. Fairness
- 5.4. Alignment
- 5.4.1. Multi-modality Alignment
- 5.4.2. Commonsense and Physics Alignment
- 5.5. Efficient Training and Fine-Tuning
- 5.6. Scarce of High-quality Dataset
0. Citation
@InProceedings{Li_2025_CVPR,
author = {Li, Zongxia and Wu, Xiyang and Du, Hongyang and Liu, Fuxiao and Nghiem, Huy and Shi, Guangyao},
title = {A Survey of State of the Art Large Vision Language Models: Benchmark Evaluations and Challenges},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2025},
pages = {1587-1606}
}
1. 📚 SoTA VLMs
| Model |
Year |
Architecture |
Training Data |
Parameters |
Vision Encoder/Tokenizer |
Pretrained Backbone Model |
| Emu3.5 |
10/30/2025 |
Deconder-only |
Unified Modality Dataset |
- |
SigLIP |
Qwen3 |
| DeepSeek-OCR |
10/20/2025 |
Encoder-Deconder |
70% OCR, 20% general vision, 10% text-only |
3B |
DeepEncoder |
DeepSeek-3B |
| Qwen3-VL |
10/11/2025 |
Decoder-Only |
- |
8B/4B |
ViT |
Qwen3 |
| Qwen3-VL-MoE |
09/25/2025 |
Decoder-Only |
- |
235B-A22B |
ViT |
Qwen3 |
| Qwen3-Omni (Visual/Audio/Text) |
09/21/2025 |
- |
Video/Audio/Image |
30B |
ViT |
Qwen3-Omni-MoE-Thinker |
| LLaVA-Onevision-1.5 |
09/15/2025 |
- |
Mid-Training-85M & SFT |
8B |
Qwen2VLImageProcessor |
Qwen3 |
| InternVL3.5 |
08/25/2025 |
Decoder-Only |
multimodal & text-only |
30B/38B/241B |
InternViT-300M/6B |
Qwen3 / GPT-OSS |
| SkyWork-Unipic-1.5B |
07/29/2025 |
- |
image/video.. |
- |
- |
- |
| Grok 4 |
07/09/2025 |
- |
image/video.. |
1-2 Trillion |
- |
- |
| Kwai Keye-VL (Kuaishou) |
07/02/2025 |
Decdoer-only |
image/video.. |
8B |
ViT |
QWen-3-8B |
| OmniGen2 |
06/23/2025 |
Decdoer-only & VAE |
LLaVA-OneVision/ SAM-LLaVA.. |
- |
ViT |
QWen-2.5-VL |
| Gemini-2.5-Pro |
06/17/2025 |
- |
- |
- |
- |
- |
| GPT-o3/o4-mini |
06/10/2025 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| Mimo-VL (Xiaomi) |
06/04/2025 |
Decdoer-only |
24 Trillion MLLM tokens |
7B |
Qwen2.5-ViT |
Mimo-7B-base |
| BLIP3-o |
05/14/2025 |
Decdoer-only |
(BLIP3-o 60K) GPT-4o Generated Image Generation Data |
4/8B |
ViT |
QWen-2.5-VL |
| InternVL-3 |
04/14/2025 |
Decdoer-only |
200 Billion Tokens |
1/2/8/9/14/38/78B |
ViT-300M/6B |
InterLM2.5/QWen2.5 |
| LLaMA4-Scout/Maverick |
04/04/2025 |
Decdoer-only |
40/20 Trillion Tokens |
17B |
MetaClip |
LLaMA4 |
| Qwen2.5-Omni |
03/26/2025 |
Decdoer-only |
Video/Audio/Image/Text |
7B |
Qwen2-Audio/Qwen2.5-VL ViT |
End-to-End Mini-Omni |
| QWen2.5-VL |
01/28/2025 |
Decdoer-only |
Image caption, VQA, grounding agent, long video |
3B/7B/72B |
Redesigned ViT |
Qwen2.5 |
| Ola |
2025 |
Decoder-only |
Image/Video/Audio/Text |
7B |
OryxViT |
Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2) |
| Ocean-OCR |
2025 |
Decdoer-only |
Pure Text, Caption, Interleaved, OCR |
3B |
NaViT |
Pretrained from scratch |
| SmolVLM |
2025 |
Decoder-only |
SmolVLM-Instruct |
250M & 500M |
SigLIP |
SmolLM |
| DeepSeek-Janus-Pro |
2025 |
Decoder-only |
Undisclosed |
7B |
SigLIP |
DeepSeek-Janus-Pro |
| Inst-IT |
2024 |
Decoder-only |
Inst-IT Dataset, LLaVA-NeXT-Data |
7B |
CLIP/Vicuna, SigLIP/Qwen2 |
LLaVA-NeXT |
| DeepSeek-VL2 |
2024 |
Decoder-only |
WiT, WikiHow |
4.5B x 74 |
SigLIP/SAMB |
DeepSeekMoE |
| xGen-MM (BLIP-3) |
2024 |
Decoder-only |
MINT-1T, OBELICS, Caption |
4B |
ViT + Perceiver Resampler |
Phi-3-mini |
| TransFusion |
2024 |
Encoder-decoder |
Undisclosed |
7B |
VAE Encoder |
Pretrained from scratch on transformer architecture |
| Baichuan Ocean Mini |
2024 |
Decoder-only |
Image/Video/Audio/Text |
7B |
CLIP ViT-L/14 |
Baichuan |
| LLaMA 3.2-vision |
2024 |
Decoder-only |
Undisclosed |
11B-90B |
CLIP |
LLaMA-3.1 |
| Pixtral |
2024 |
Decoder-only |
Undisclosed |
12B |
CLIP ViT-L/14 |
Mistral Large 2 |
| Qwen2-VL |
2024 |
Decoder-only |
Undisclosed |
7B-14B |
EVA-CLIP ViT-L |
Qwen-2 |
| NVLM |
2024 |
Encoder-decoder |
LAION-115M |
8B-24B |
Custom ViT |
Qwen-2-Instruct |
| Emu3 |
2024 |
Decoder-only |
Aquila |
7B |
MoVQGAN |
LLaMA-2 |
| Claude 3 |
2024 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| InternVL |
2023 |
Encoder-decoder |
LAION-en, LAION- multi |
7B/20B |
Eva CLIP ViT-g |
QLLaMA |
| InstructBLIP |
2023 |
Encoder-decoder |
CoCo, VQAv2 |
13B |
ViT |
Flan-T5, Vicuna |
| CogVLM |
2023 |
Encoder-decoder |
LAION-2B ,COYO-700M |
18B |
CLIP ViT-L/14 |
Vicuna |
| PaLM-E |
2023 |
Decoder-only |
All robots, WebLI |
562B |
ViT |
PaLM |
| LLaVA-1.5 |
2023 |
Decoder-only |
COCO |
13B |
CLIP ViT-L/14 |
Vicuna |
| Gemini |
2023 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| GPT-4V |
2023 |
Decoder-only |
Undisclosed |
Undisclosed |
Undisclosed |
Undisclosed |
| BLIP-2 |
2023 |
Encoder-decoder |
COCO, Visual Genome |
7B-13B |
ViT-g |
Open Pretrained Transformer (OPT) |
| Flamingo |
2022 |
Decoder-only |
M3W, ALIGN |
80B |
Custom |
Chinchilla |
| BLIP |
2022 |
Encoder-decoder |
COCO, Visual Genome |
223M-400M |
ViT-B/L/g |
Pretrained from scratch |
| CLIP |
2021 |
Encoder-decoder |
400M image-text pairs |
63M-355M |
ViT/ResNet |
Pretrained from scratch |
2. 🗂️ Benchmarks and Evaluation
2.1. Datasets for Training VLMs
| Dataset |
Task |
Size |
| FineVision |
Mixed Domain |
24.3 M/4.48TB |
2.2. Datasets and Evaluation for VLM
🧮 Visual Math (+ Visual Math Reasoning)
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| MathVision |
Visual Math |
MC / Answer Match |
Human |
3.04 |
Repo |
| MathVista |
Visual Math |
MC / Answer Match |
Human |
6 |
Repo |
| MathVerse |
Visual Math |
MC |
Human |
4.6 |
Repo |
| VisNumBench |
Visual Number Reasoning |
MC |
Python Program generated/Web Collection/Real life photos |
1.91 |
Repo |
🎞️ Video Understanding
💬 Multimodal Conversation
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| VisionArena |
Multimodal Conversation |
Pairwise Pref |
Human |
23 |
Repo |
🧠 Multimodal General Intelligence
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| MMLU |
General MM |
MC |
Human |
15.9 |
Repo |
| MMStar |
General MM |
MC |
Human |
1.5 |
Site |
| NaturalBench |
General MM |
Yes/No, MC |
Human |
10 |
HF |
| PHYSBENCH |
Visual Math Reasoning |
MC |
Grad STEM |
0.10 |
Repo |
🔎 Visual Reasoning / VQA (+ Multilingual & OCR)
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| EMMA |
Visual Reasoning |
MC |
Human + Synth |
2.8 |
Repo |
| MMTBENCH |
Visual Reasoning & QA |
MC |
AI Experts |
30.1 |
Repo |
| MM‑Vet |
OCR / Visual Reasoning |
LLM Eval |
Human |
0.2 |
Repo |
| MM‑En/CN |
Multilingual MM Understanding |
MC |
Human |
3.2 |
Repo |
| GQA |
Visual Reasoning & QA |
Answer Match |
Seed + Synth |
22 |
Site |
| VCR |
Visual Reasoning & QA |
MC |
MTurks |
290 |
Site |
| VQAv2 |
Visual Reasoning & QA |
Yes/No, Ans Match |
MTurks |
1100 |
Repo |
| MMMU |
Visual Reasoning & QA |
Ans Match, MC |
College |
11.5 |
Site |
| MMMU-Pro |
Visual Reasoning & QA |
Ans Match, MC |
College |
5.19 |
Site |
| R1‑Onevision |
Visual Reasoning & QA |
MC |
Human |
155 |
Repo |
| VLM²‑Bench |
Visual Reasoning & QA |
Ans Match, MC |
Human |
3 |
Site |
| VisualWebInstruct |
Visual Reasoning & QA |
LLM Eval |
Web |
0.9 |
Site |
📝 Visual Text / Document Understanding (+ Charts)
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| TextVQA |
Visual Text Understanding |
Ans Match |
Expert |
28.6 |
Repo |
| DocVQA |
Document VQA |
Ans Match |
Crowd |
50 |
Site |
| ChartQA |
Chart Graphic Understanding |
Ans Match |
Crowd / Synth |
32.7 |
Repo |
🌄 Text‑to‑Image Generation
| Dataset |
Task |
Eval Protocol |
Annotators |
Size (K) |
Code / Site |
| MSCOCO‑30K |
Text‑to‑Image |
BLEU, ROUGE, Sim |
MTurks |
30 |
Site |
| GenAI‑Bench |
Text‑to‑Image |
Human Rating |
Human |
80 |
HF |
🚨 Hallucination Detection / Control
2.3. Benchmark Datasets, Simulators, and Generative Models for Embodied VLM
| Benchmark |
Domain |
Type |
Project |
| Drive-Bench |
Embodied AI |
Autonomous Driving |
Website |
| Habitat, Habitat 2.0, Habitat 3.0 |
Robotics (Navigation) |
Simulator + Dataset |
Website |
| Gibson |
Robotics (Navigation) |
Simulator + Dataset |
Website, Github Repo |
| iGibson1.0, iGibson2.0 |
Robotics (Navigation) |
Simulator + Dataset |
Website, Document |
| Isaac Gym |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| Isaac Lab |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| AI2THOR |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| ProcTHOR |
Robotics (Navigation) |
Simulator + Dataset |
Website, Github Repo |
| VirtualHome |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| ThreeDWorld |
Robotics (Navigation) |
Simulator |
Website, Github Repo |
| VIMA-Bench |
Robotics (Manipulation) |
Simulator |
Website, Github Repo |
| VLMbench |
Robotics (Manipulation) |
Simulator |
Github Repo |
| CALVIN |
Robotics (Manipulation) |
Simulator |
Website, Github Repo |
| GemBench |
Robotics (Manipulation) |
Simulator |
Website, Github Repo |
| WebArena |
Web Agent |
Simulator |
Website, Github Repo |
| UniSim |
Robotics (Manipulation) |
Generative Model, World Model |
Website |
| GAIA-1 |
Robotics (Automonous Driving) |
Generative Model, World Model |
Website |
| LWM |
Embodied AI |
Generative Model, World Model |
Website, Github Repo |
| Genesis |
Embodied AI |
Generative Model, World Model |
Github Repo |
| EMMOE |
Embodied AI |
Generative Model, World Model |
Paper |
| RoboGen |
Embodied AI |
Generative Model, World Model |
Website |
| UnrealZoo |
Embodied AI (Tracking, Navigation, Multi Agent) |
Simulator |
Website |
3. ⚒️ Post-Training
3.1. RL Alignment for VLM
| Title |
Year |
Paper |
RL |
Code |
| Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning |
10/12/2025 |
Paper |
GRPO |
- |
| Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play |
09/29/2025 |
Paper |
GRPO |
- |
| Vision-SR1: Self-rewarding vision-language model via reasoning decomposition |
08/26/2025 |
Paper |
GRPO |
- |
| Group Sequence Policy Optimization |
06/24/2025 |
Paper |
GSPO |
- |
| Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning |
05/20/2025 |
Paper |
GRPO |
- |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning |
2025/04/10 |
Paper |
GRPO |
Code |
| OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement |
2025/03/21 |
Paper |
GRPO |
Code |
| Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning |
2025/03/10 |
Paper |
GRPO |
Code |
| OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference |
2025 |
Paper |
DPO |
Code |
| Multimodal Open R1/R1-Multimodal-Journey |
2025 |
- |
GRPO |
Code |
| R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization |
2025 |
Paper |
GRPO |
Code |
| Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning |
2025 |
- |
PPO/REINFORCE++/GRPO |
Code |
| MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning |
2025 |
Paper |
REINFORCE Leave-One-Out (RLOO) |
Code |
| MM-RLHF: The Next Step Forward in Multimodal LLM Alignment |
2025 |
Paper |
DPO |
Code |
| LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL |
2025 |
Paper |
PPO |
Code |
| Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models |
2025 |
Paper |
GRPO |
Code |
| Unified Reward Model for Multimodal Understanding and Generation |
2025 |
Paper |
DPO |
Code |
| Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step |
2025 |
Paper |
DPO |
Code |
| All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning |
2025 |
Paper |
Online RL |
- |
| Video-R1: Reinforcing Video Reasoning in MLLMs |
2025 |
Paper |
GRPO |
Code |
3.2. Finetuning for VLM
| Title |
Year |
Paper |
Website |
Code |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models |
2025/04/21 |
Paper |
Website |
Code |
| OMNICAPTIONER: One Captioner to Rule Them All |
2025/04/09 |
Paper |
Website |
Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning |
2024 |
Paper |
Website |
Code |
| LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression |
2024 |
Paper |
Website |
Code |
| ViTamin: Designing Scalable Vision Models in the Vision-Language Era |
2024 |
Paper |
Website |
Code |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model |
2024 |
Paper |
- |
- |
| Should VLMs be Pre-trained with Image Data? |
2025 |
Paper |
- |
- |
| VisionArena: 230K Real World User-VLM Conversations with Preference Labels |
2024 |
Paper |
- |
Code |
3.3. VLM Alignment github
3.4. Prompt Optimization
| Title |
Year |
Paper |
Website |
Code |
| In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer |
2025/04/30 |
Paper |
Website |
Code |
4. ⚒️ Applications
4.1 Embodied VLM Agents
| Title |
Year |
Paper Link |
| Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI |
2024 |
Paper |
| ScreenAI: A Vision-Language Model for UI and Infographics Understanding |
2024 |
Paper |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation |
2023 |
Paper |
| SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement |
2024 |
📄 Paper |
| Training a Vision Language Model as Smartphone Assistant |
2024 |
Paper |
| ScreenAgent: A Vision-Language Model-Driven Computer Control Agent |
2024 |
Paper |
| Embodied Vision-Language Programmer from Environmental Feedback |
2024 |
Paper |
| VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method |
2025 |
📄 Paper |
| MP-GUI: Modality Perception with MLLMs for GUI Understanding |
2025 |
📄 Paper |
4.2. Generative Visual Media Applications
| Title |
Year |
Paper |
Website |
Code |
| GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| Spurious Correlation in Multimodal LLMs |
2025 |
📄 Paper |
- |
- |
| WeGen: A Unified Model for Interactive Multimodal Generation as We Chat |
2025 |
📄 Paper |
- |
💾 Code |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
4.3. Robotics and Embodied AI
| Title |
Year |
Paper |
Website |
Code |
| AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation |
2024 |
📄 Paper |
🌍 Website |
- |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
2024 |
📄 Paper |
🌍 Website |
- |
| Vision-language model-driven scene understanding and robotic object manipulation |
2024 |
📄 Paper |
- |
- |
| Guiding Long-Horizon Task and Motion Planning with Vision Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers |
2023 |
📄 Paper |
🌍 Website |
- |
| VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model |
2024 |
📄 Paper |
- |
- |
| Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? |
2023 |
📄 Paper |
🌍 Website |
- |
| DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| MotionGPT: Human Motion as a Foreign Language |
2023 |
📄 Paper |
- |
💾 Code |
| Learning Reward for Robot Skills Using Large Language Models via Self-Alignment |
2024 |
📄 Paper |
- |
- |
| Language to Rewards for Robotic Skill Synthesis |
2023 |
📄 Paper |
🌍 Website |
- |
| Eureka: Human-Level Reward Design via Coding Large Language Models |
2023 |
📄 Paper |
🌍 Website |
- |
| Integrated Task and Motion Planning |
2020 |
📄 Paper |
- |
- |
| Jailbreaking LLM-Controlled Robots |
2024 |
📄 Paper |
🌍 Website |
- |
| Robots Enact Malignant Stereotypes |
2022 |
📄 Paper |
🌍 Website |
- |
| LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions |
2024 |
📄 Paper |
- |
- |
| Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics |
2024 |
📄 Paper |
🌍 Website |
- |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents |
2025 |
📄 Paper |
🌍 Website |
💾 Code & Dataset |
| Gemini Robotics: Bringing AI into the Physical World |
2025 |
📄 Technical Report |
🌍 Website |
- |
| GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation |
2024 |
📄 Paper |
🌍 Website |
- |
| Magma: A Foundation Model for Multimodal AI Agents |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| DayDreamer: World Models for Physical Robot Learning |
2022 |
📄 Paper |
🌍 Website |
💾 Code |
| Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models |
2025 |
📄 Paper |
- |
- |
| RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| Unified Video Action Model |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
4.3.1. Manipulation
| Title |
Year |
Paper |
Website |
Code |
| VIMA: General Robot Manipulation with Multimodal Prompts |
2022 |
📄 Paper |
🌍 Website |
|
| Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model |
2023 |
📄 Paper |
- |
- |
| Creative Robot Tool Use with Large Language Models |
2023 |
📄 Paper |
🌍 Website |
- |
| RoboVQA: Multimodal Long-Horizon Reasoning for Robotics |
2024 |
📄 Paper |
- |
- |
| RT-1: Robotics Transformer for Real-World Control at Scale |
2022 |
📄 Paper |
🌍 Website |
- |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control |
2023 |
📄 Paper |
🌍 Website |
- |
| Open X-Embodiment: Robotic Learning Datasets and RT-X Models |
2023 |
📄 Paper |
🌍 Website |
- |
| ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| Masked World Models for Visual Control |
2022 |
📄 Paper |
🌍 Website |
💾 Code |
| Multi-View Masked World Models for Visual Robotic Manipulation |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
4.3.2. Navigation
| Title |
Year |
Paper |
Website |
Code |
| ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings |
2022 |
📄 Paper |
- |
- |
| LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation |
2024 |
📄 Paper |
- |
- |
| LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action |
2022 |
📄 Paper |
🌍 Website |
- |
| NaVILA: Legged Robot Vision-Language-Action Model for Navigation |
2022 |
📄 Paper |
🌍 Website |
- |
| VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation |
2024 |
📄 Paper |
- |
- |
| Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning |
2023 |
📄 Paper |
🌍 Website |
- |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments |
2025 |
📄 Paper |
- |
- |
| Navigation World Models |
2024 |
📄 Paper |
🌍 Website |
- |
4.3.3. Human-robot Interaction
| Title |
Year |
Paper |
Website |
Code |
| MUTEX: Learning Unified Policies from Multimodal Task Specifications |
2023 |
📄 Paper |
🌍 Website |
- |
| LaMI: Large Language Models for Multi-Modal Human-Robot Interaction |
2024 |
📄 Paper |
🌍 Website |
- |
| VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models |
2024 |
📄 Paper |
- |
- |
4.3.4. Autonomous Driving
| Title |
Year |
Paper |
Website |
Code |
| Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives |
01/07/2025 |
📄 Paper |
🌍 Website |
- |
| DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| GPT-Driver: Learning to Drive with GPT |
2023 |
📄 Paper |
- |
- |
| LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving |
2023 |
📄 Paper |
🌍 Website |
- |
| Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving |
2023 |
📄 Paper |
- |
- |
| Referring Multi-Object Tracking |
2023 |
📄 Paper |
- |
💾 Code |
| VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision |
2023 |
📄 Paper |
- |
💾 Code |
| MotionLM: Multi-Agent Motion Forecasting as Language Modeling |
2023 |
📄 Paper |
- |
- |
| DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models |
2023 |
📄 Paper |
🌍 Website |
- |
| VLP: Vision Language Planning for Autonomous Driving |
2024 |
📄 Paper |
- |
- |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model |
2023 |
📄 Paper |
- |
- |
4.4. Human-Centered AI
| Title |
Year |
Paper |
Website |
Code |
| DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis |
2024 |
📄 Paper |
- |
💾 Code |
| LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application |
2024 |
📄 Paper |
- |
- |
| Pretrained Language Models as Visual Planners for Human Assistance |
2023 |
📄 Paper |
- |
- |
| Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research |
2024 |
📄 Paper |
- |
- |
| Image and Data Mining in Reticular Chemistry Using GPT-4V |
2023 |
📄 Paper |
- |
- |
4.4.1. Web Agent
| Title |
Year |
Paper |
Website |
Code |
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis |
2023 |
📄 Paper |
- |
- |
| CogAgent: A Visual Language Model for GUI Agents |
2023 |
📄 Paper |
- |
💾 Code |
| WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models |
2024 |
📄 Paper |
- |
💾 Code |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent |
2024 |
📄 Paper |
- |
💾 Code |
| ScreenAgent: A Vision Language Model-driven Computer Control Agent |
2024 |
📄 Paper |
- |
💾 Code |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation |
2024 |
📄 Paper |
- |
💾 Code |
4.4.2. Accessibility
| Title |
Year |
Paper |
Website |
Code |
| X-World: Accessibility, Vision, and Autonomy Meet |
2021 |
📄 Paper |
- |
- |
| Context-Aware Image Descriptions for Web Accessibility |
2024 |
📄 Paper |
- |
- |
| Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models |
2024 |
📄 Paper |
- |
- |
4.4.3. Healthcare
| Title |
Year |
Paper |
Website |
Code |
| VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge |
2024 |
📄 Paper |
- |
💾 Code |
| Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology |
2024 |
📄 Paper |
- |
- |
| M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization |
2023 |
📄 Paper |
- |
- |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text |
2022 |
📄 Paper |
- |
💾 Code |
| Med-Flamingo: A Multimodal Medical Few-Shot Learner |
2023 |
📄 Paper |
- |
💾 Code |
4.4.4. Social Goodness
| Title |
Year |
Paper |
Website |
Code |
| Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy |
2024 |
📄 Paper |
- |
- |
| Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence |
2024 |
📄 Paper |
- |
- |
| Harnessing Large Vision and Language Models in Agriculture: A Review |
2024 |
📄 Paper |
- |
- |
| A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping |
2024 |
📄 Paper |
- |
- |
| Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images |
2024 |
📄 Paper |
- |
- |
| MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps |
2024 |
📄 Paper |
- |
💾 Code |
| He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation |
2021 |
📄 Paper |
- |
- |
| UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling |
2024 |
📄 Paper |
- |
- |
5. Challenges
5.1 Hallucination
| Title |
Year |
Paper |
Website |
Code |
| Object Hallucination in Image Captioning |
2018 |
📄 Paper |
- |
- |
| Evaluating Object Hallucination in Large Vision-Language Models |
2023 |
📄 Paper |
- |
💾 Code |
| Detecting and Preventing Hallucinations in Large Vision Language Models |
2023 |
📄 Paper |
- |
- |
| HallE-Control: Controlling Object Hallucination in Large Multimodal Models |
2023 |
📄 Paper |
- |
💾 Code |
| Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs |
2024 |
📄 Paper |
- |
💾 Code |
| BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models |
2023 |
📄 Paper |
- |
💾 Code |
| AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning |
2023 |
📄 Paper |
- |
💾 Code |
| Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation |
2023 |
📄 Paper |
- |
💾 Code |
5.2 Safety
| Title |
Year |
Paper |
Website |
Code |
| JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments |
2023 |
📄 Paper |
- |
- |
| SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models |
2024 |
📄 Paper |
- |
- |
| JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks |
2024 |
📄 Paper |
- |
- |
| SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models |
2024 |
📄 Paper |
- |
💾 Code |
| Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models |
2024 |
📄 Paper |
- |
- |
| Jailbreaking Attack against Multimodal Large Language Model |
2024 |
📄 Paper |
- |
- |
| Embodied Red Teaming for Auditing Robotic Foundation Models |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| Safety Guardrails for LLM-Enabled Robots |
2025 |
📄 Paper |
- |
- |
5.3 Fairness
| Title |
Year |
Paper |
Website |
Code |
| Hallucination of Multimodal Large Language Models: A Survey |
2024 |
📄 Paper |
- |
- |
| Bias and Fairness in Large Language Models: A Survey |
2023 |
📄 Paper |
- |
- |
| Fairness and Bias in Multimodal AI: A Survey |
2024 |
📄 Paper |
- |
- |
| Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models |
2023 |
📄 Paper |
- |
- |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks |
2024 |
📄 Paper |
- |
- |
| FairCLIP: Harnessing Fairness in Vision-Language Learning |
2024 |
📄 Paper |
- |
- |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models |
2024 |
📄 Paper |
- |
- |
| Benchmarking Vision Language Models for Cultural Understanding |
2024 |
📄 Paper |
- |
- |
5.4 Alignment
5.4.1 Multi-modality Alignment
| Title |
Year |
Paper |
Website |
Code |
| Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding |
2024 |
📄 Paper |
- |
- |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement |
2024 |
📄 Paper |
- |
- |
| Assessing and Learning Alignment of Unimodal Vision and Language Models |
2024 |
📄 Paper |
🌍 Website |
- |
| Extending Multi-modal Contrastive Representations |
2023 |
📄 Paper |
- |
💾 Code |
| OneLLM: One Framework to Align All Modalities with Language |
2023 |
📄 Paper |
- |
💾 Code |
| What You See is What You Read? Improving Text-Image Alignment Evaluation |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
5.4.2 Commonsense and Physics Alignment
| Title |
Year |
Paper |
Website |
Code |
| VBench: Comprehensive BenchmarkSuite for Video Generative Models |
2023 |
📄 Paper |
🌍 Website |
💾 Code |
| VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| VideoPhy: Evaluating Physical Commonsense for Video Generation |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| WorldSimBench: Towards Video Generation Models as World Simulators |
2024 |
📄 Paper |
🌍 Website |
- |
| WorldModelBench: Judging Video Generation Models As World Models |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation |
2025 |
📄 Paper |
- |
💾 Code |
| Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency |
2025 |
📄 Paper |
- |
💾 Code |
| Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding |
2025 |
📄 Paper |
- |
- |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| Do generative video models understand physical principles? |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
| PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| How Far is Video Generation from World Model: A Physical Law Perspective |
2024 |
📄 Paper |
🌍 Website |
💾 Code |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought |
2025 |
📄 Paper |
- |
- |
| VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness |
2025 |
📄 Paper |
🌍 Website |
💾 Code |
5.5 Efficient Training and Fine-Tuning
| Title |
Year |
Paper |
Website |
Code |
| VILA: On Pre-training for Visual Language Models |
2023 |
📄 Paper |
- |
- |
| SimVLM: Simple Visual Language Model Pretraining with Weak Supervision |
2021 |
📄 Paper |
- |
- |
| LoRA: Low-Rank Adaptation of Large Language Models |
2021 |
📄 Paper |
- |
💾 Code |
| QLoRA: Efficient Finetuning of Quantized LLMs |
2023 |
📄 Paper |
- |
- |
| Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |
2022 |
📄 Paper |
- |
💾 Code |
| RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback |
2023 |
📄 Paper |
- |
- |
5.6 Scarce of High-quality Dataset
| Title |
Year |
Paper |
Website |
Code |
| A Survey on Bridging VLMs and Synthetic Data |
2025 |
📄 Paper |
- |
💾 Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning |
2024 |
📄 Paper |
Website |
💾 Code |
| SLIP: Self-supervision meets Language-Image Pre-training |
2021 |
📄 Paper |
- |
💾 Code |
| Synthetic Vision: Training Vision-Language Models to Understand Physics |
2024 |
📄 Paper |
- |
- |
| Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings |
2024 |
📄 Paper |
- |
- |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data |
2024 |
📄 Paper |
- |
- |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation |
2024 |
📄 Paper |
- |
- |