Awesome-LVLM-paper
Awesome-LVLM-paper copied to clipboard

Published 20 hours ago •

→

Metadata

:sunglasses: List of papers about Large Multimodal model

Readme
Issues

:sunglasses: Awesome-LVLMs

Related Collection

Our Paper Reading List

Topic	Description
LVLM Model	Large multimodal models / Foundation Model
Multimodal Benchmark	:heart_eyes: Interesting Multimodal Benchmark
LVLM Agent	Agent & Application of LVLM
LVLM Hallucination	Benchmark & Methods for Hallucination

:building_construction: LVLM Models

Title	Venue/Date	Note	Code	Demo	Picture
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	NeurIPS 2023	InstructBLIP	Github	Local Demo
Visual Instruction Tuning	NeurIPS 2023	LLaVA	GitHub	Demo
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model	2023-04	LLaMA Adapter v2	Github	Demo
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality	2023-04	mPLUG	Github	Demo
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models	2023-04	MiniGPT-4	Github	-
TextBind: Multi-turn Interleaved Multimodal Instruction-following	2023-09	TextBind	Github	Demo
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning	2023-09	BLIP-Diffusion	Github	Demo
NExT-GPT: Any-to-Any Multimodal LLM	2023-09	NeXT-GPT	Github	Demo
Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions	ICLR 2024	Multi-image Reasoning	Github	-
Ferret: Refer and Ground Anything Anywhere at Any Granularity	ICLR 2024	Grounding	Github	-
LLaVA-OneVision: Easy Visual Task Transfer	Technical Report 2024-7	LLaVA-OV: Blog with details	Project
Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution	Technical Report 2024-10	Qwen2-VL: Dynamic resolution & Multi-images & Video	Github	Demo
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Technical Report 2024-12	Deepseek-VL2: MOE Tiny: 1B, Small: 3B DeepSeek-VL2: 5B	Github
DeepSeek-V3 Technical Report	Technical Report 2024-12	🧠 671B MoE parameters 🚀 37B activated 📚 14.8T tokens Blog	Project

:calendar: Multimodal Benchamrk

Title	Venue/Date	Note	Code	Demo	Picture
MMMU: A Massive Multi-discipline Multimodal	CVPR 2024	11K Multimodal Questions Reasoning Benchmark	project
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought	ACL 2024	Multimodal COT: Multi-step visual modal reasoning	project
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor	MM 2024	Multimodal Correction	Github
Right this way: Can VLMs Guide Us to See More to Answer Questions?	NeurIPS 2024	For visually impaired people	Github
FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models	NeurIPS 2024	Multimodal Refinement 100K data	Project
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model	EMNLP 2024	Abstract Image Reasoning Benchmark	Project
$Star$ Math-PUMA: Progressive Upward Multimodal Alignment to Enhance Mathematical Reasoning	AAAI 2025	Math Reasoning & Weak2Strong Data	[Github]
Multimodal Situational Safety	ICLR 2025 Submission (Positive Score)	Multimodal Safety Benchmark	Project
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos	ICLR 2025 Submission (Positive Score)	MMMU in Video QA	Project

:control_knobs: LVLM Agent

Title	Venue/Date	Note	Code	Demo	Picture
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action	2023-03	MM-REACT	Github	Demo
Visual Programming: Compositional visual reasoning without training	CVPR 2023 Best Paper	VISPROG (Similar to ViperGPT)	Github	Local Demo
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace	2023-03	HuggingfaceGPT	Github	Demo
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models	2023-04	Chameleon	Github	Demo
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models	2023-05	IdealGPT	Github	Local Demo
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn	2023-06	AssistGPT	Github	-
A Picture Is Worth a Graph: A Blueprint Debate Paradigm for Multimodal Reasoning	ACM MM 2024	Multi-Agent Debate	Github
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models	NeurIPS 2024	Draw to facilitate reasoning	Project

:face_with_head_bandage: LVLM Hallunication

Title	Venue/Date	Note	Code	Demo	Picture
Evaluating Object Hallucination in Large Vision-Language Models	EMNLP 2023	Simple Object Hallunicattion Evaluation - POPE	Github	-
Evaluation and Analysis of Hallucination in Large Vision-Language Models	2023-10	Hallunicattion Evaluation - HaELM	Github	-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning	2023-06	GPT4-Assisted Visual Instruction Evaluation (GAVIE) & LRV-Instruction	Github	Demo
Woodpecker: Hallucination Correction for Multimodal Large Language Models	2023-10	First work to correct hallucinations in LVLMs	Github	Demo
Can We Edit Multimodal Large Language Models?	EMNLP 2023	Knowledge Editing Benchmark	Github	-
Grounding Visual Illusions in Language:Do Vision-Language Models Perceive Illusions Like Humans?	EMNLP 2023	Similar to human illusion?	Github	-
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models	2024-11	Vision-language generative reward	project

About

:sunglasses: List of papers about Large Multimodal model

18

Stars

1

Forks

Watchers

Owner

← Metadata

18

Stars

1

Forks

Watchers

Owner

Metadata

:sunglasses: List of papers about Large Multimodal model