Awesome-LVLM-paper icon indicating copy to clipboard operation
Awesome-LVLM-paper copied to clipboard

:sunglasses: List of papers about Large Multimodal model

:sunglasses: Awesome-LVLMs

Related Collection

Our Paper Reading List

Topic Description
LVLM Model Large multimodal models / Foundation Model
Multimodal Benchmark :heart_eyes: Interesting Multimodal Benchmark
LVLM Agent Agent & Application of LVLM
LVLM Hallucination Benchmark & Methods for Hallucination

:building_construction: LVLM Models

Title Venue/Date Note Code Demo Picture
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
NeurIPS 2023 InstructBLIP Github Local Demo instrucblip
Star
Visual Instruction Tuning
NeurIPS 2023 LLaVA GitHub Demo llava
Star
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
2023-04 LLaMA Adapter v2 Github Demo llama
Star
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
2023-04 mPLUG Github Demo image-20241221163809570
Star
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
2023-04 MiniGPT-4 Github - minigpt-4
Star
TextBind: Multi-turn Interleaved Multimodal Instruction-following
2023-09 TextBind Github Demo textbind
Star
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
2023-09 BLIP-Diffusion Github Demo blip-diffusion
Star
NExT-GPT: Any-to-Any Multimodal LLM
2023-09 NeXT-GPT Github Demo next-gpt
Star
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
ICLR 2024 Multi-image Reasoning Github - VPG
Star
Ferret: Refer and Ground Anything Anywhere at Any Granularity
ICLR 2024 Grounding Github - ferret
Star
LLaVA-OneVision: Easy Visual Task Transfer
Technical Report 2024-7 LLaVA-OV: Blog with details Project image-20241221110841873
Star
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution
Technical Report 2024-10 Qwen2-VL: Dynamic resolution & Multi-images & Video Github Demo image-20241221105930185
Star
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Technical Report 2024-12 Deepseek-VL2: MOE
Tiny: 1B, Small: 3B DeepSeek-VL2: 5B
Github image-20241221110551477
Star
DeepSeek-V3 Technical Report
Technical Report 2024-12 🧠 671B MoE parameters
🚀 37B activated
📚 14.8T tokens
Blog
Project image-20241228113108108

:calendar: Multimodal Benchamrk

Title Venue/Date Note Code Demo Picture
Star
MMMU: A Massive Multi-discipline Multimodal
CVPR 2024 11K Multimodal Questions Reasoning Benchmark project algebraic reasoning
Star
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
ACL 2024 Multimodal COT: Multi-step visual modal reasoning project image-20241221112255186
Star
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor
MM 2024 Multimodal Correction Github image-20241221112534221
Star
Right this way: Can VLMs Guide Us to See More to Answer Questions?
NeurIPS 2024 For visually impaired people Github image-20241221163141433
Star
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models
NeurIPS 2024 Multimodal Refinement 100K data Project image-20241226104808366
Star
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
EMNLP 2024 Abstract Image Reasoning Benchmark Project image-20241227103310143
Star
Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning
AAAI 2025 Math Reasoning &
Weak2Strong Data
[Github] image-20241226105753500
Star
Multimodal Situational Safety
ICLR 2025 Submission (Positive Score) Multimodal Safety Benchmark Project image-20241223102926454
Star
MMWorld: Towards Multi-discipline Multi-faceted
World Model Evaluation in Videos

ICLR 2025 Submission (Positive Score) MMMU in Video QA Project image-20241223103309015

:control_knobs: LVLM Agent

Title Venue/Date Note Code Demo Picture
Star
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
2023-03 MM-REACT Github Demo mm-react
Star
Visual Programming: Compositional visual reasoning without training
CVPR 2023 Best Paper VISPROG (Similar to ViperGPT) Github Local Demo vp
Star
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace
2023-03 HuggingfaceGPT Github Demo huggingface-gpt
Star
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
2023-04 Chameleon Github Demo chameleon
Star
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
2023-05 IdealGPT Github Local Demo ideal-gpt
Star
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
2023-06 AssistGPT Github - assist-gpt
Star
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning
ACM MM 2024 Multi-Agent Debate Github image-20241221111626526
Star
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
NeurIPS 2024 Draw to facilitate reasoning Project image-20241225110818819

:face_with_head_bandage: LVLM Hallunication

Title Venue/Date Note Code Demo Picture
Star
Evaluating Object Hallucination in Large Vision-Language Models
EMNLP 2023 Simple Object Hallunicattion Evaluation - POPE Github - pope
Star
Evaluation and Analysis of Hallucination in Large Vision-Language Models
2023-10 Hallunicattion Evaluation - HaELM Github - HaELM
Star
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
2023-06 GPT4-Assisted Visual Instruction Evaluation (GAVIE) & LRV-Instruction Github Demo gavie
Star
Woodpecker: Hallucination Correction for Multimodal Large Language Models
2023-10 First work to correct hallucinations in LVLMs Github Demo Woodpecker
Star
Can We Edit Multimodal Large Language Models?
EMNLP 2023 Knowledge Editing Benchmark Github - mm-edit
Star
Grounding Visual Illusions in Language:Do Vision-Language Models Perceive Illusions Like Humans?
EMNLP 2023 Similar to human illusion? Github - illusion
Star
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
2024-11 Vision-language generative reward project image-20241221163651585