Multi-Modal-Large-Language-Learning
Multi-Modal-Large-Language-Learning copied to clipboard
Awesome multi-modal large language paper/project, collections of popular training strategies, e.g., PEFT, LoRA.
Multi-modal Large Language Model Collection 🦕
This is a curated list of Multi-modal Large Language Models (MLLM), Multimodal Benchmarks (MMB), Multimodal Instruction Tuning (MMIT), Multimodal In-context Learning (MMIL), Foundation Models (e.g., CLIP families) (FM), and the most popular Parameter-Efficient Tuning methods.
📒Table of Contents
- Alignment
- Multi-modal Large Language Models (MLLM)
- Multi-modal Benchmarks (MMB)
- Foundation Models (FM)
- Parameter-Efficient Tuning Repo (PETR)
Alignment
-
MDPO: Conditional Preference Optimization for Multimodal Large Language Models [arXiv 2024/06/17] [Paper]
University of Southern California, University of California, Davis, Microsoft Research -
RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness [arXiv 2024/05/27] [Paper] [Code]
Department of Computer Science and Technology, Tsinghua University, NExT++ Lab, School of Computing, National University of Singapore -
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback [CVPR 2024] [Paper] [Code] [Homepage]
Tsinghua University, National University of Singapore, Shenzhen International Graduate School, Tsinghua University, Pengcheng Laboratory, Shenzhen, China
Multi-modal Large Language Models (MLLM)
-
VILA2: VILA Augmented VILA [arXiv 2025/0724] [Paper]
NVIDIA, UT Austin, MIT -
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [ECCV2024] [Paper] [Code] [Homepage]
University of Science and Technology of China, Shanghai AI Laboratory -
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [arXiv 2024/02/12] [Paper] [Code]
Peking University, Peng Cheng Laboratory, Sun Yat-sen University, Guangzhou, Tencent Data Platform, AI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, FarReel Ai Lab -
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [arXiv 2024/02/12] [Paper] [Code] [Evaluation]
Stanford, Toyota Research Institute -
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models [arXiv 2024/03/27] [Paper] [Code] [Project Page]
The Chinese University of Hong Kong, SmartMore -
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [arXiv 2024/01/15] [Paper] [Code]
OpenGVLab, Shanghai AI Laboratory, Nanjing University, The University of Hong Kong, The Chinese University of Hong Kong, Tsinghua University, University of Science and Technology of China, SenseTime Research -
GiT: Towards Generalist Vision Transformer through Universal Language Interface [arXiv 2024/03/14] [Paper]
Peking University, Max Planck Institute for Informatics, The Chinese University of Hong Kong Shenzhen, ETH Zurich, The Chinese University of Hong Kong -
LLaMA: Open and Efficient Foundation Language Models [arXiv 2023] [Paper] [Github Repo]
Meta AI
Multimodal Benchmarks (MMB)
-
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [arXiv 2024/06/17] [Paper] [Code] [HomePage] [Space🤗]
Wuhan University, Shanghai AI Laboratory, The Chinese University of Hong Kong, MThreads, Inc.
Foundation Models (FM)
Parameter-Efficient Tuning Repo (PETR)
-
PEFT: Parameter-Efficient Fine-Tuning [HuggingFace 🤗] [Home Page] [Code]
PEFT, or Parameter-Efficient Fine-Tuning (PEFT), is a library for efficiently adapting pre-trained language models (PLMs) to various downstream applications without fine-tuning all the model’s parameters. -
LLaMA Efficient Tuning [Github Repo]
Easy-to-use fine-tuning framework using PEFT (PT+SFT+RLHF with QLoRA) (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen). -
LLaMA-Adapter: Efficient Fine-tuning of LLaMA 🚀[Code]
Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters -
LLaMA2-Accessory 🚀[Code]
An Open-source Toolkit for LLM Development -
LLaMA Factory: Training and Evaluating Large Language Models with Minimal Effort Code]
Easy-to-use LLM fine-tuning framework (LLaMA-2, BLOOM, Falcon, Baichuan, Qwen, ChatGLM3)