Multi-modal Large Language Model Collection 🦕

This is a curated list of Multi-modal Large Language Models (MLLM), Multimodal Benchmarks (MMB), Multimodal Instruction Tuning (MMIT), Multimodal In-context Learning (MMIL), Foundation Models (e.g., CLIP families) (FM), and the most popular Parameter-Efficient Tuning methods.

📒Table of Contents

Alignment
Multi-modal Large Language Models (MLLM)
Multi-modal Benchmarks (MMB)
Foundation Models (FM)
Parameter-Efficient Tuning Repo (PETR)

Alignment

MDPO: Conditional Preference Optimization for Multimodal Large Language Models [arXiv 2024/06/17] [Paper]
University of Southern California, University of California, Davis, Microsoft Research
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [arXiv 2024/05/27] [Paper] [Code]
Department of Computer Science and Technology, Tsinghua University, NExT++ Lab, School of Computing, National University of Singapore
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback [CVPR 2024] [Paper] [Code] [Homepage]
Tsinghua University, National University of Singapore, Shenzhen International Graduate School, Tsinghua University, Pengcheng Laboratory, Shenzhen, China

Multi-modal Large Language Models (MLLM)

VILA²: VILA Augmented VILA [arXiv 2025/0724] [Paper]
NVIDIA, UT Austin, MIT
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [ECCV2024] [Paper] [Code] [Homepage]
University of Science and Technology of China, Shanghai AI Laboratory
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [arXiv 2024/02/12] [Paper] [Code]
Peking University, Peng Cheng Laboratory, Sun Yat-sen University, Guangzhou, Tencent Data Platform, AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, FarReel Ai Lab
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [arXiv 2024/02/12] [Paper] [Code] [Evaluation]
Stanford, Toyota Research Institute
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models [arXiv 2024/03/27] [Paper] [Code] [Project Page]
The Chinese University of Hong Kong, SmartMore
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [arXiv 2024/01/15] [Paper] [Code]
OpenGVLab, Shanghai AI Laboratory, Nanjing University, The University of Hong Kong, The Chinese University of Hong Kong, Tsinghua University, University of Science and Technology of China, SenseTime Research
GiT: Towards Generalist Vision Transformer through Universal Language Interface [arXiv 2024/03/14] [Paper]
Peking University, Max Planck Institute for Informatics, The Chinese University of Hong Kong Shenzhen, ETH Zurich, The Chinese University of Hong Kong
LLaMA: Open and Efficient Foundation Language Models [arXiv 2023] [Paper] [Github Repo]
Meta AI

Multimodal Benchmarks (MMB)

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [arXiv 2024/06/17] [Paper] [Code] [HomePage] [Space🤗]
Wuhan University, Shanghai AI Laboratory, The Chinese University of Hong Kong, MThreads, Inc.

Foundation Models (FM)

Parameter-Efficient Tuning Repo (PETR)

PEFT: Parameter-Efficient Fine-Tuning [HuggingFace 🤗] [Home Page] [Code]
PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters.
LLaMA Efficient Tuning [Github Repo]
Easy-to-use fine-tuning framework using PEFT (PT+SFT+RLHF with QLoRA) (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen).
LLaMA-Adapter: Efficient Fine-tuning of LLaMA 🚀[Code]
Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
LLaMA2-Accessory 🚀[Code]
An Open-source Toolkit for LLM Development
LLaMA Factory: Training and Evaluating Large Language Models with Minimal Effort Code]
Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, ChatGLM3)

Multi-Modal-Large-Language-Learning
Multi-Modal-Large-Language-Learning copied to clipboard

Metadata

Multi-modal Large Language Model Collection 🦕

📒Table of Contents

Alignment

Multi-modal Large Language Models (MLLM)

Multimodal Benchmarks (MMB)

Foundation Models (FM)

Parameter-Efficient Tuning Repo (PETR)

← Metadata

Owner

Metadata

Multi-Modal-Large-Language-Learning Multi-Modal-Large-Language-Learning copied to clipboard

Metadata

Multi-modal Large Language Model Collection 🦕

📒Table of Contents

Alignment

Multi-modal Large Language Models (MLLM)

Multimodal Benchmarks (MMB)

Foundation Models (FM)

Parameter-Efficient Tuning Repo (PETR)

← Metadata

Owner

Metadata

Multi-Modal-Large-Language-Learning
Multi-Modal-Large-Language-Learning copied to clipboard