Wings: Learning Multimodal LLMs without Text-only Forgetting |
arXiv |
2024-06-05 |
- |
- |
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG) |
arXiv |
2024-06-05 |
- |
- |
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM |
arXiv |
2024-06-05 |
 |
- |
OLIVE: Object Level In-Context Visual Embeddings |
ACL 2024 |
2024-06-02 |
 |
- |
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA) |
arXiv |
2024-05-29 |
- |
 |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |
arXiv |
2024-05-24 |
 |
- |
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models |
arXiv |
2024-05-24 |
- |
- |
LOVA3: Learning to Visual Question Answering, Asking and Assessment |
arXiv |
2024-05-23 |
 |
- |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability |
arXiv |
2024-05-23 |
 |
 |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |
arXiv |
2024-05-09 |
 |


 |
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any) |
arXiv |
2024-05-09 |
 |

 |
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google) |
arXiv |
2024-05-05 |
 |

 |
MANTIS: Interleaved Multi-Image Instruction Tuning |
arXiv |
2024-05-02 |
 |
 |
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs |
CVPR 2024 Workshop |
2024-04-23 |
- |
 |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing |
- |
2024-04-25 |
 |


 |
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation |
arXiv |
2024-04-22 |
 |
- |
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models |
arXiv |
2024-04-19 |
 |

 |
MoVA: Adapting Mixture of Vision Experts to Multimodal Context |
arXiv |
2024-04-19 |
 |
- |
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models |
arXiv |
2024-04-18 |
- |

 |
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset) |
arXiv |
2024-04-15 |
 |
- |
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2) |
arXiv |
2024-04-11 |
- |
- |
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series) |
arXiv |
2024-04-09 |

 |
 |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI) |
arXiv |
2024-04-08 |
- |
- |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
CVPR 2024 |
2024-04-08 |
 |
 |
Koala: Key frame-conditioned long video-LLM |
CVPR 2024 |
2024-04-05 |
 |
 |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens |
arXiv |
2024-04-04 |
 |
 |
LongVLM: Efficient Long Video Understanding via Large Language Models |
arXiv |
2024-04-04 |
 |
- |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training |
arXiv |
2024-03-14 |
- |
- |
UniCode: Learning a Unified Codebook for Multimodal Large Language Models |
arXiv |
2024-03-14 |
- |
- |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
arXiv |
2024-03-08 |
- |
 |
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models |
arXiv |
2023-03-05 |
 |
- |
RegionGPT: Towards Region Understanding Vision Language Model |
CVPR 2024 |
2024-03-04 |
- |
 |
All in an Aggregated Image for In-Image Learning |
arXiv |
2024-02-28 |
 |
- |
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners |
CVPR 2024 |
2024-02-27 |
 |
 |
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages |
arXiv |
2024-02-25 |
- |
- |
LLMBind: A Unified Modality-Task Integration Framework |
arXiv |
2024-02-22 |
- |
- |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling |
arXiv |
2024-02-19 |
 |
 |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA) |
arXiv |
2024-02-18 |
 |

 |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |
arXiv |
2024-02-06 |
 |
- |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization |
arXiv |
2024-02-05 |
 |
 |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action |
arXiv |
2023-12-28 |
 |
 |
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices |
arXiv |
2023-12-28 |
 |
- |
Generative Multimodal Models are In-Context Learners (Emu2) |
CVPR 2024 |
2023-12-20 |
 |
 |
Gemini: A Family of Highly Capable Multimodal Models |
arXiv |
2023-12-19 |
- |
 |
Osprey: Pixel Understanding with Visual Instruction Tuning |
CVPR 2024 |
2023-12-15 |
 |
- |
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation |
arXiv |
2023-12-14 |
 |
- |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models |
arXiv |
2023-12-11 |
 |
 |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
CVPR 2024 |
2023-12-07 |
 |
 |
PixelLM: Pixel Reasoning with Large Multimodal Model |
CVPR 2024 |
2023-12-04 |
 |
 |
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models |
EMNLP 2023 |
2023-12-04 |
 |
 |
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation |
arXiv |
2023-11-30 |
 |
 |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models |
arXiv |
2023-11-28 |
 |

 |
LLMGA: Multimodal Large Language Model based Generation Assistant |
arXiv |
2023-11-27 |
 |
 |
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models |
arXiv |
2023-11-22 |
 |
- |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |
arXiv |
2023-11-21 |
 |
 |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge |
CVPR 2024 |
2023-11-20 |
 |
 |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection |
arXiv |
2023-11-16 |
 |
- |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
arXiv |
2023-11-07 |
 |
- |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning |
arXiv |
2023-10-14 |
 |
 |
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs |
arXiv |
2023-10-13 |
 |
- |
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret) |
ICLR 2024 |
2023-10-11 |
 |
- |
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models |
arXiv |
2023-10-11 |
 |
 |
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) |
arXiv |
2023-10-05 |
 |
 |
Kosmos-G: Generating Images in Context with Multimodal Large Language Models |
ICLR 2024 |
2023-10-04 |
 |
 |
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens |
arXiv |
2023-10-03 |
 |
 |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) |
arXiv |
2023-09-25 |
 |

 |
DreamLLM: Synergistic Multimodal Comprehension and Creation |
ICLR 2024 |
2023-09-20 |
 |
 |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |
ICLR 2024 |
2023-09-14 |
 |
- |
NExT-GPT: Any-to-Any Multimodal LLM |
arXiv |
2023-09-11 |
 |
 |
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT) |
ICLR 2024 |
2023-09-09 |
 |
- |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
arXiv |
2023-08-24 |
 |
 |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint) |
ICLR 2024 |
2023-08-23 |
 |
- |
Planting a SEED of Vision in Large Language Model |
ICLR 2024 |
2023-07-16 |
 |
 |
Generative Pretraining in Multimodality (Emu1) |
ICLR 2024 |
2023-07-11 |
 |
- |
SVIT: Scaling up Visual Instruction Tuning |
arXiv |
2023-07-09 |
 |
 |
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset) |
arXiv |
2023-06-26 |
 |

 |
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |
arXiv |
2023-06-07 |
- |

 |
Generating Images with Multimodal Language Models (GILL) |
NeurIPS 2023 |
2023-05-26 |
 |
 |
Any-to-Any Generation via Composable Diffusion (CoDi-1) |
NeurIPS 2023 |
2023-05-19 |
 |
 |
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities |
EMNLP 2023 (Findings) |
2023-05-18 |
 |
 |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
NeurIPS 2023 |
2023-05-11 |
 |
- |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |
arXiv |
2023-05-08 |
 |
- |
VPGTrans: Transfer Visual Prompt Generator across LLMs |
NeurIPS 2023 |
2023-05-02 |
 |
 |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |
arXiv |
2023-04-27 |
 |
- |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |
ICLR 2024 |
2023-04-20 |
 |
 |
Visual Instruction Tuning (LLaVA) |
NeurIPS 2023 |
2023-04-17 |
 |

 |
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
NeurIPS 2023 |
2023-02-27 |
 |
- |
Multimodal Chain-of-Thought Reasoning in Language Models |
arXiv |
2023-02-02 |
 |
- |
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe) |
ICML 2023 |
2023-01-31 |
 |
 |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
ICML 2023 |
2023-01-30 |
 |
- |
Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS 2022 |
2022-04-29 |
 |
- |