Awesome-Document-Understanding
Awesome-Document-Understanding copied to clipboard

→

Metadata

Document Artifical Intelligence

Readme
Issues

🌟 Awesome-Document-Understanding

A curated list of awesome Document Understanding resources, including papers, codes, and datasets.

continue update 🤗

📋 Table of contents

Milestone
Document Understanding
MLLM
Grounded MLLM
Video LLM

🏆 Milestone

MiniCPM

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe (OpenBMB) | 25.8.25
MiniCPM-V: A GPT-4V Level MLLM on Your Phone (OpenBMB) | 24.8.3
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (THU,ModelBest) | 24.4.9
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.3.18
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (THU,ShanghaiAILab,Zhuhu,ModelBest) | 23.8.23

VILA

NVILA: Efficient Frontier Visual Language Models (NVIDIA,MIT,UCB,TW,THU) | 24.12.5
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation (THU,MIT,NVIDIA,UCB,UCSD) | 24.9.6
LongVILA: Scaling Long-Context Visual Language Models for Long Videos (NVIDIA,MIT,KAUST) | 24.8.19
VILA2: VILA Augmented VILA (NVIDIA,MIT,UT-Austin) | 24.7.24
X-VILA: Cross-Modality Alignment for Large Language Model (NVIDIA,MIT,KAUST) | 24.5.29
VILA: On Pre-training for Visual Language Models (NVIDIA,MIT) | 23.12.12

LLaMA

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation (Meta) | 25.4.5
The Llama 3 Herd of Models (Meta) | 24.7.13
Llama 2: Open Foundation and Fine-Tuned Chat Models (Meta) | 23.7.18
LLaMA: Open and Efficient Foundation Language Models (Meta) | 23.2.27

Qwen

Qwen3-VL: Sharper Vision, Deeper Thought, Broader Action (Alibaba) | 25.9.23
Qwen3-Omni Technical Report (Alibaba) | 25.9.22
Qwen2.5-VL Technical Report (Alibaba) | 25.2.19
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (Alibaba) | 24.9.18
Qwen2 Technical Report (Alibaba) | 24.7.15
Qwen Technical Report (Alibaba) | 23.9.28
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.8.24

Intern

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency (Shanghai AI Lab) | 25.8.25
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models (Shanghai AI Lab) | 25.4.14
Mini-InternVL 2.0: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance (Shanghai AI Lab) | 24.10.21
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance (Shanghai AI Lab) | 24.10.21
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (Shanghai AI Lab) | 24.7.3
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab) | 24.4.9
InternLM2 Technical Report (Shanghai AI Lab) | 24.3.26
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab) | 24.1.29
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (Shanghai AI Lab) | 23.9.26
InternLM: A Multilingual Language Model with Progressively Enhanced Capabilities (Shanghai AI Lab) | 23.6.3

📑 Document Understanding

2025

Towards Visual Text Grounding of Multimodal Large Language Model (Adobe,Maryland) | 25.4.7
A Simple yet Effective Layout Token in Large Language Models for Document Understanding (ZJU,Alibaba) | 25.3.24
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding (Adobe) | 25.3.18
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks (Baidu) | 25.3.6

2024

DocVLM: Make Your VLM an Efficient Reader (AWS) | 24.12.11
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens (Huawei) | 24.10.7 | arXiv | Code
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models (KAIST,AWS) | 24.10.4 | arXiv | Code
mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding (Alibaba,RUC) | 24.9.5 | arXiv | Code
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model (StepFun,Megvii,UCAS,THU) | 24.9.3 | arXiv | Code
LLaVA-Read: Enhancing Reading Ability of Multimodal Language Models (Adobe,Buffalo) | 24.7.27 | arXiv | Code
Harmonizing Visual Text Comprehension and Generation (ECNU,ByteDance) | 24.7.23 | 24NIPS | Code
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding (FDU,ByteDance) | 24.7.2 | arXiv | Code
Multimodal Table Understanding (UCAS,Baidu) | 24.06.12 | ACL24 | Code
TRINS: Towards Multimodal Language Models that Can Read (Adobe,GIT) | 24.06.10 | CVPR24 | Code
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy (USTC,ByteDance) | 24.6.3 | arXiv | Code
StrucTexTv3: An Efficient Vision-Language Model for Text-rich Image Perception, Comprehension, and Beyond (Baidu) | 24.5.31 | arXiv | Code
Focus Anywhere for Fine-grained Multi-page Document Understanding (UCAS,MEGVII) | 24.5.23 | arXiv | Code
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering (ByteDance,HUST) | 24.5.20 | arXiv | Code
Exploring the Capabilities of Large Multimodal Models on Dense Text (HUST) | 24.5.9 | arXiv | Code
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,CUHK,THU,NJU,FDU,SenseTime) | 24.4.25 | arXiv | Code
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding (USTC) | 24.4.15 | arXiv | Code
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD (Shanghai AI Lab,CUHK,THU,SenseTime) | 24.4.9 | arXiv | Code
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding (Alibaba,ZJU) | 24.4.8 | arXiv | Code
Visual CoT: Unleashing Chain-of-Thought Reasoning in Multi-Modal Language Models (CUHK,Shanghai AI Lab,SenseTime) | 24.3.25 | arXiv | Code
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding (Alibaba,RUC) | 24.3.19 | arXiv | Code
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document (HUST) | 24.3.7 | arXiv | Code
HRVDA: High-Resolution Visual Document Assistant (Tencent YouTu Lab,USTC) | 24.2.29 | CVPR24
Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models (Tencent YouTu Lab) | 24.2.29 | CVPR24
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab,CUHK,SenseTime) | 24.1.29 | arXiv | Code
Small Language Model Meets with Reinforced Vision Vocabulary (MEGVII,UCAS,HUST) | 24.1.23 | arXiv | Code

2023

DocLLM: A layout-aware generative language model for multimodal document understanding (JPMorgan AI Research) | 23.12.31 | arXiv
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models (MEGVII,UCAS,HUST) | 23.12.11 | ECCV24 | Code
mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model (Alibaba) | 23.11.30 | arXiv | Code
Towards Improving Document Understanding: An Exploration on Text-Grounding via MLLMs (USTC) | 23.11.22 | arXiv | Code
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding (USTC,ByteDance) | 23.11.20 | arXiv
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models (HUST) | 23.11.11 | CVPR24 | Code
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration (Alibaba) | 23.11.07 | CVPR24 | Code
Exploring OCR Capabilities of GPT-4V(ision) : A Quantitative and In-depth Evaluation (SCUT) | 23.10.25 | arXiv | Code
UReader: Universal OCR-free Visually-situated Language Understanding with Multimodal Large Language Model (DAMO,RUC,ECNU) | 23.10.08 | arXiv | Code
Kosmos-2.5: A Multimodal Literate Model (MSRA) | 23.9.20 | arXiv | Code
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions (UC San Diego) | 23.8.19 | AAAI24 | Code
UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding (USTC,ByteDance) | 23.8.19 | arXiv
mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding (DAMO) | 23.7.4 | arXiv | Code
On the Hidden Mystery of OCR in Large Multimodal Models (HUST,SCUT,Microsoft) | 23.5.13 | arXiv | Code
Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution (HUST) | 23.5.12 | arXiv | Code
Document Understanding Dataset and Evaluation (DUDE) | 23.5.15 | arXiv | Website
StrucTexTv2: Masked Visual-Textual Prediction for Document Image Pre-training (Baidu) | 23.03.01 | ICLR23 | Code

2022

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding (Huawei) | 22.12.19 | ACL23
Unifying Vision, Text, and Layout for Universal Document Processing (Microsoft) | 22.12.05 | CVPR23 | Code
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding (Baidu) | 22.10.12 | arXiv | Code
Unified Pretraining Framework for Document Understanding (Adobe) | 22.04.22 | NIPS21
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking (Microsoft) | 22.04.18 | ACM MM22 | Code
XYLayoutLM: Towards Layout-Aware Multimodal Networks For Visually-Rich Document Understanding (Alibaba) | 22.3.14 | Code Unofficial
DiT: Self-supervised Pre-training for Document Image Transformer (Microsoft) | 22.03.04 | ACM MM22 | Code
Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark (Huawei) | 22.2.14 | NIPS22 | Code

2021

LayoutReader: Pre-training of Text and Layout for Reading Order Detection (Microsoft) | 21.08.26 | EMNLP21 | Code
LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding (Microsoft) | 21.04.18 | arXiv | Code
Going Full-TILT Boogie on Document Understanding with Text-Image-Layout Transformer (Applica) | 21.02.18 | ICDAR21 | Code

2020

LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (Microsoft) | 20.12.29 | arXiv | Code

2020

LayoutLM: Pre-training of Text and Layout for Document Image Understanding (Microsoft) | 19.12.31 | KDD20 | Code

🔮 MLLM

2025

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model (ZJU,Om AI) | 25.4.10
Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources (UCSB,ByteDance,Nvidia) | 25.4.1
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization (NTU,THU) | 25.3.17
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization (ZJU,RUC,Tencent) | 25.3.13
Oasis: One Image is All You Need for Multimodal Instruction Data Synthesis (TJU,ByteDance) | 25.3.11
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL (CUHK,FDU,Ant Group) | 25.3.10
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models (ECNU,Xionghongshu) | 25.3.9

2024

FastVLM: Efficient Vision Encoding for Vision Language Models (Apple) | 24.12.17
OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference (NexaAI) | 24.12.16
Improving Multi-modal Large Language Model through Boosting Vision Capabilities (NJUST,Baidu,HUST) | 24.10.17
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models (Allen,UW) | 24.9.25
LLaVA-OneVision: Easy Visual Task Transfer (ByteDance,NTU,CUHK,HKUST) | 24.8.6
MiniCPM-V: A GPT-4V Level MLLM on Your Phone (OpenBMB) | 24.8.3
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs (NYU) | 24.6.24
X-VILA: Cross-Modality Alignment for Large Language Model (NVIDIA,HKUST,MIT) | 24.5.29 | arXiv | Code
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites (Shanghai AI Lab,SenseTime,THU,NJU,FU,CUHK) | 24.04.25 | arXiv | Code
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (THU,ModelBest) | 24.4.9
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models (CUHK,SmartMore) | 24.3.27 | Code
LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images (THU,NUS,UCAS) | 24.03.18 | arXiv | Code
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models (XMU) | 24.03.05 | arXiv | Code
DualFocus: Integrating Macro and Micro Perspectives in Multi-modal Large Language Models (CUHK,Shanghai AI Lab) | 24.2.22 | arXiv | Code
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (Shanghai AI Lab) | 24.01.29 | arXiv | Code

2023

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks (OpenGVLab,NJU,HKU,CUHK,THU,USTC,SenseTime) | 23.12.21 | CVPR24 | Code
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts (UWM,Cruise LLC) | 23.12.01 | CVPR24 | Code
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions (USTC,Shanghai AI Lab) | 23.11.28 | arXiv | Code
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning (KAUST,Meta) | 23.10.14 | arXiv | Code
Improved Baselines with Visual Instruction Tuning (UWM,Microsoft) | 23.10.05 | arXiv | Code
InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition (Shanghai AI Lab) | 23.02.26 | arXiv | Code
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (Alibaba) | 23.08.24 | arXiv | Code
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action (Azure) | 23.05.20 | arXiv | Code
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning (Salesforce) | 23.05.11 | arXiv | Code
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (DAMO) | 23.04.27 | arXiv | Code
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models (KAUST) | 23.04.20 | arXiv | Code
Visual Instruction Tuning (UWM,Microsoft) | 23.04.17 | NeurIPS | Code
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (Salesforce) | 23.01.30 | arXiv | Code

2022

Flamingo: a Visual Language Model for Few-Shot Learning (Deepmind) | 22.11.15 | Nips22 | Code

🎯 Grounded MLLM

2024

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model (UCSD,HKU,NVIDIA) | 24.6.3 | arXiv | Code
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models (HKU,ByteDance) | 24.4.19 | ECCV24 | Code
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (CU,UCSB,Apple) | 24.04.11 | arXiv | Code
SpatialPIN: Enhancing Spatial Reasoning Capabilities of Vision-Language Models through Prompting and Interacting 3D Priors (University of Oxford) | 24.3.18
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Apple) | 24.04.08 | arXiv | Code
GroundingGPT: Language Enhanced Multi-modal Grounding Model (ByteDance,FDU) | 24.03.05 | arXiv | Code

2023

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models (HKUST,SCUT,IDEA,CUHK) | 23.12.05 | arXiv | Code
Ferret: Refer and Ground Anything Anywhere at Any Granularity (CU,Apple) | 23.10.11 | arXiv | Code
BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs (ByteDance) | 23.07.17 | arXiv | Code
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic (SenseTime,BUAA,SJTU) | 23.06.27 | arXiv | Code
Kosmos-2: Grounding Multimodal Large Language Models to the World (Microsoft) | 23.06.26 | arXiv | Code

🎬 Video LLM

2024

Artemis: Towards Referential Understanding in Complex Videos (UCAS,UB) | 24.6.1 | arXiv | Code

2023

TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding (PKU,Noah) | 23.12.04 | CVPR24 | Code
Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models (PKU,PengCheng,Microsoft,FarReel) | 23.11.27 | arXiv | Code
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection (PKU,PengCheng) | 23.11.16 | arXiv | code
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding (PKU,PengCheng) | 23.11.14 | arXiv | Code
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding (DAMO) | 23.06.05 | arXiv | code

About

Document Artifical Intelligence

170

Stars

8

Forks

Watchers

Owner

← Metadata

170

Stars

8

Forks

Watchers

Owner

Metadata

Document Artifical Intelligence