Awesome-Document-Understanding
Awesome-Document-Understanding copied to clipboard
Document Artifical Intelligence
🌟 Awesome-Document-Understanding 
A curated list of awesome Document Understanding resources, including papers, codes, and datasets.
continue update 🤗
📋 Table of contents
- Milestone
- Document Understanding
- MLLM
- Grounded MLLM
- Video LLM
🏆 Milestone
MiniCPM
- MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe (OpenBMB) | 25.8.25
- MiniCPM-V: A GPT-4V Level MLLM on Your Phone (OpenBMB) | 24.8.3
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (THU,ModelBest) | 24.4.9
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.3.18
- Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (THU,ShanghaiAILab,Zhuhu,ModelBest) | 23.8.23
VILA
- NVILA: Efficient Frontier Visual Language Models (NVIDIA,MIT,UCB,TW,THU) | 24.12.5
- VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (THU,MIT,NVIDIA,UCB,UCSD) | 24.9.6
- LongVILA: Scaling Long-Context Visual Language Models for Long Videos (NVIDIA,MIT,KAUST) | 24.8.19
- VILA2: VILA Augmented VILA (NVIDIA,MIT,UT-Austin) | 24.7.24
- X-VILA: Cross-Modality Alignment for Large Language Model (NVIDIA,MIT,KAUST) | 24.5.29
- VILA: On Pre-training for Visual Language Models (NVIDIA,MIT) | 23.12.12
LLaMA
- The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation (Meta) | 25.4.5
- The Llama 3 Herd of Models (Meta) | 24.7.13
- Llama 2: Open Foundation and Fine-Tuned Chat Models (Meta) | 23.7.18
- LLaMA: Open and Efficient Foundation Language Models (Meta) | 23.2.27
Qwen
- Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action (Alibaba) | 25.9.23
- Qwen3-Omni Technical Report (Alibaba) | 25.9.22
- Qwen2.5-VL Technical Report (Alibaba) | 25.2.19
- Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (Alibaba) | 24.9.18
- Qwen2 Technical Report (Alibaba) | 24.7.15
- Qwen Technical Report (Alibaba) | 23.9.28
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.8.24
Intern
- InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency (Shanghai AI Lab) | 25.8.25
- InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (Shanghai AI Lab) | 25.4.14
- Mini-InternVL 2.0: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance (Shanghai AI Lab) | 24.10.21
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance (Shanghai AI Lab) | 24.10.21
- InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (Shanghai AI Lab) | 24.7.3
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab) | 24.4.9
- InternLM2 Technical Report (Shanghai AI Lab) | 24.3.26
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab) | 24.1.29
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (Shanghai AI Lab) | 23.9.26
- InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities (Shanghai AI Lab) | 23.6.3
📑 Document Understanding
2025
- Towards Visual Text Grounding of Multimodal Large Language Model (Adobe,Maryland) | 25.4.7
- A Simple yet Effective Layout Token in Large Language Models for Document Understanding (ZJU,Alibaba) | 25.3.24
- MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding (Adobe) | 25.3.18
- PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks (Baidu) | 25.3.6
2024
- DocVLM: Make Your VLM an Efficient Reader (AWS) | 24.12.11
- TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens (Huawei) | 24.10.7 | arXiv | Code
- DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models (KAIST,AWS) | 24.10.4 | arXiv | Code
- mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding (Alibaba,RUC) | 24.9.5 | arXiv | Code
- General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (StepFun,Megvii,UCAS,THU) | 24.9.3 | arXiv | Code
- LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models (Adobe,Buffalo) | 24.7.27 | arXiv | Code
- Harmonizing Visual Text Comprehension and Generation (ECNU,ByteDance) | 24.7.23 | 24NIPS | Code
- A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding (FDU,ByteDance) | 24.7.2 | arXiv | Code
- Multimodal Table Understanding (UCAS,Baidu) | 24.06.12 | ACL24 | Code
- TRINS: Towards Multimodal Language Models that Can Read (Adobe,GIT) | 24.06.10 | CVPR24 | Code
- TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy (USTC,ByteDance) | 24.6.3 | arXiv | Code
- StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond (Baidu) | 24.5.31 | arXiv | Code
- Focus Anywhere for Fine-grained Multi-page Document Understanding (UCAS,MEGVII) | 24.5.23 | arXiv | Code
- MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (ByteDance,HUST) | 24.5.20 | arXiv | Code
- Exploring the Capabilities of Large Multimodal Models on Dense Text (HUST) | 24.5.9 | arXiv | Code
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,CUHK,THU,NJU,FDU,SenseTime) | 24.4.25 | arXiv | Code
- TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding (USTC) | 24.4.15 | arXiv | Code
- InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab,CUHK,THU,SenseTime) | 24.4.9 | arXiv | Code
- LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding (Alibaba,ZJU) | 24.4.8 | arXiv | Code
- Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (CUHK,Shanghai AI Lab,SenseTime) | 24.3.25 | arXiv | Code
- mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (Alibaba,RUC) | 24.3.19 | arXiv | Code
- TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (HUST) | 24.3.7 | arXiv | Code
- HRVDA: High-Resolution Visual Document Assistant (Tencent YouTu Lab,USTC) | 24.2.29 | CVPR24
- Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (Tencent YouTu Lab) | 24.2.29 | CVPR24
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab,CUHK,SenseTime) | 24.1.29 | arXiv | Code
- Small Language Model Meets with Reinforced Vision Vocabulary (MEGVII,UCAS,HUST) | 24.1.23 | arXiv | Code
2023
- DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan AI Research) | 23.12.31 | arXiv
- Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (MEGVII,UCAS,HUST) | 23.12.11 | ECCV24 | Code
- mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model (Alibaba) | 23.11.30 | arXiv | Code
- Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs (USTC) | 23.11.22 | arXiv | Code
- DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding (USTC,ByteDance) | 23.11.20 | arXiv
- Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models (HUST) | 23.11.11 | CVPR24 | Code
- mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration (Alibaba) | 23.11.07 | CVPR24 | Code
- Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation (SCUT) | 23.10.25 | arXiv | Code
- UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (DAMO,RUC,ECNU) | 23.10.08 | arXiv | Code
- Kosmos-2.5: A Multimodal Literate Model (MSRA) | 23.9.20 | arXiv | Code
- BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions (UC San Diego) | 23.8.19 | AAAI24 | Code
- UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding (USTC,ByteDance) | 23.8.19 | arXiv
- mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (DAMO) | 23.7.4 | arXiv | Code
- On the Hidden Mystery of OCR in Large Multimodal Models (HUST,SCUT,Microsoft) | 23.5.13 | arXiv | Code
- Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution (HUST) | 23.5.12 | arXiv | Code
- Document Understanding Dataset and Evaluation (DUDE) | 23.5.15 | arXiv | Website
- StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training (Baidu) | 23.03.01 | ICLR23 | Code
2022
- Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding (Huawei) | 22.12.19 | ACL23
- Unifying Vision, Text, and Layout for Universal Document Processing (Microsoft) | 22.12.05 | CVPR23 | Code
- ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding (Baidu) | 22.10.12 | arXiv | Code
- Unified Pretraining Framework for Document Understanding (Adobe) | 22.04.22 | NIPS21
- LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking (Microsoft) | 22.04.18 | ACM MM22 | Code
- XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding (Alibaba) | 22.3.14 | Code Unofficial
- DiT: Self-supervised Pre-training for Document Image Transformer (Microsoft) | 22.03.04 | ACM MM22 | Code
- Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark (Huawei) | 22.2.14 | NIPS22 | Code
2021
- LayoutReader: Pre-training of Text and Layout for Reading Order Detection (Microsoft) | 21.08.26 | EMNLP21 | Code
- LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding (Microsoft) | 21.04.18 | arXiv | Code
- Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (Applica) | 21.02.18 | ICDAR21 | Code
2020
- LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (Microsoft) | 20.12.29 | arXiv | Code
2020
- LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Microsoft) | 19.12.31 | KDD20 | Code
🔮 MLLM
2025
- VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model (ZJU,Om AI) | 25.4.10
- Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources (UCSB,ByteDance,Nvidia) | 25.4.1
- R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization (NTU,THU) | 25.3.17
- R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization (ZJU,RUC,Tencent) | 25.3.13
- Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis (TJU,ByteDance) | 25.3.11
- LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL (CUHK,FDU,Ant Group) | 25.3.10
- Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (ECNU,Xionghongshu) | 25.3.9
2024
- FastVLM: Efficient Vision Encoding for Vision Language Models (Apple) | 24.12.17
- OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference (NexaAI) | 24.12.16
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities (NJUST,Baidu,HUST) | 24.10.17
- Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models (Allen,UW) | 24.9.25
- LLaVA-OneVision: Easy Visual Task Transfer (ByteDance,NTU,CUHK,HKUST) | 24.8.6
- MiniCPM-V: A GPT-4V Level MLLM on Your Phone (OpenBMB) | 24.8.3
- Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (NYU) | 24.6.24
- X-VILA: Cross-Modality Alignment for Large Language Model (NVIDIA,HKUST,MIT) | 24.5.29 | arXiv | Code
- How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,SenseTime,THU,NJU,FU,CUHK) | 24.04.25 | arXiv | Code
- MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (THU,ModelBest) | 24.4.9
- Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (CUHK,SmartMore) | 24.3.27 | Code
- LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.03.18 | arXiv | Code
- Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models (XMU) | 24.03.05 | arXiv | Code
- DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models (CUHK,Shanghai AI Lab) | 24.2.22 | arXiv | Code
- InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab) | 24.01.29 | arXiv | Code
2023
- InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (OpenGVLab,NJU,HKU,CUHK,THU,USTC,SenseTime) | 23.12.21 | CVPR24 | Code
- ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts (UWM,Cruise LLC) | 23.12.01 | CVPR24 | Code
- ShareGPT4V: Improving Large Multi-Modal Models with Better Captions (USTC,Shanghai AI Lab) | 23.11.28 | arXiv | Code
- MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (KAUST,Meta) | 23.10.14 | arXiv | Code
- Improved Baselines with Visual Instruction Tuning (UWM,Microsoft) | 23.10.05 | arXiv | Code
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (Shanghai AI Lab) | 23.02.26 | arXiv | Code
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.08.24 | arXiv | Code
- MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (Azure) | 23.05.20 | arXiv | Code
- InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (Salesforce) | 23.05.11 | arXiv | Code
- mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (DAMO) | 23.04.27 | arXiv | Code
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (KAUST) | 23.04.20 | arXiv | Code
- Visual Instruction Tuning (UWM,Microsoft) | 23.04.17 | NeurIPS | Code
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Salesforce) | 23.01.30 | arXiv | Code
2022
- Flamingo: a Visual Language Model for Few-Shot Learning (Deepmind) | 22.11.15 | Nips22 | Code
🎯 Grounded MLLM
2024
- SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model (UCSD,HKU,NVIDIA) | 24.6.3 | arXiv | Code
- Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models (HKU,ByteDance) | 24.4.19 | ECCV24 | Code
- Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (CU,UCSB,Apple) | 24.04.11 | arXiv | Code
- SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors (University of Oxford) | 24.3.18
- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apple) | 24.04.08 | arXiv | Code
- GroundingGPT: Language Enhanced Multi-modal Grounding Model (ByteDance,FDU) | 24.03.05 | arXiv | Code
2023
- LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models (HKUST,SCUT,IDEA,CUHK) | 23.12.05 | arXiv | Code
- Ferret: Refer and Ground Anything Anywhere at Any Granularity (CU,Apple) | 23.10.11 | arXiv | Code
- BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs (ByteDance) | 23.07.17 | arXiv | Code
- Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (SenseTime,BUAA,SJTU) | 23.06.27 | arXiv | Code
- Kosmos-2: Grounding Multimodal Large Language Models to the World (Microsoft) | 23.06.26 | arXiv | Code
🎬 Video LLM
2024
- Artemis: Towards Referential Understanding in Complex Videos (UCAS,UB) | 24.6.1 | arXiv | Code
2023
- TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding (PKU,Noah) | 23.12.04 | CVPR24 | Code
- Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (PKU,PengCheng,Microsoft,FarReel) | 23.11.27 | arXiv | Code
- Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (PKU,PengCheng) | 23.11.16 | arXiv | code
- Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding (PKU,PengCheng) | 23.11.14 | arXiv | Code
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (DAMO) | 23.06.05 | arXiv | code