Wings: Learning Multimodal LLMs without Text-only Forgetting |
arXiv |
2024-06-05 |
- |
- |
Enhancing Multimodal Large Language Models with Multi-instance Visual Prompt Generator for Visual Representation Enrichment (MIVPG) |
arXiv |
2024-06-05 |
- |
- |
PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM |
arXiv |
2024-06-05 |
![Star](https://img.shields.io/github/stars/posterllava/PosterLLaVA.svg?style=social&label=Star) |
- |
OLIVE: Object Level In-Context Visual Embeddings |
ACL 2024 |
2024-06-02 |
![Star](https://img.shields.io/github/stars/tossowski/OLIVE.svg?style=social&label=Star) |
- |
X-VILA: Cross-Modality Alignment for Large Language Model (by NVIDIA) |
arXiv |
2024-05-29 |
- |
![Wechat](https://img.shields.io/badge/-WeChat@%E6%95%B0%E6%BA%90AI-000000?logo=wechat&logoColor=07C160) |
ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models |
arXiv |
2024-05-24 |
![Star](https://img.shields.io/github/stars/alibaba/conv-llava.svg?style=social&label=Star) |
- |
Prompt-Aware Adapter: Towards Learning Adaptive Visual Tokens for Multimodal Large Language Models |
arXiv |
2024-05-24 |
- |
- |
LOVA3: Learning to Visual Question Answering, Asking and Assessment |
arXiv |
2024-05-23 |
![Star](https://img.shields.io/github/stars/showlab/LOVA3.svg?style=social&label=Star) |
- |
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability |
arXiv |
2024-05-23 |
![Star](https://img.shields.io/github/stars/AlignGPT-VL/AlignGPT.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts |
arXiv |
2024-05-09 |
![Star](https://img.shields.io/github/stars/SHI-Labs/CuMo.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)
![Wechat](https://img.shields.io/badge/-WeChat@%E6%95%B0%E6%BA%90AI-000000?logo=wechat&logoColor=07C160) |
Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers (Lumina-T2X, Flag-DiT) (Text2Any) |
arXiv |
2024-05-09 |
![Star](https://img.shields.io/github/stars/Alpha-VLLM/Lumina-T2X.svg?style=social&label=Star) |
![YouTube](https://badges.aleen42.com/src/youtube.svg)
![Wechat](https://img.shields.io/badge/-WeChat@%E6%9C%BA%E5%99%A8%E4%B9%8B%E5%BF%83-000000?logo=wechat&logoColor=07C160) |
ImageInWords: Unlocking Hyper-Detailed Image Descriptions (Google) |
arXiv |
2024-05-05 |
![Star](https://img.shields.io/github/stars/google/imageinwords.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
MANTIS: Interleaved Multi-Image Instruction Tuning |
arXiv |
2024-05-02 |
![Star](https://img.shields.io/github/stars/TIGER-AI-Lab/Mantis.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs |
CVPR 2024 Workshop |
2024-04-23 |
- |
![Wechat](https://img.shields.io/badge/-WeChat@%E6%95%B0%E6%BA%90AI-000000?logo=wechat&logoColor=07C160) |
VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing |
- |
2024-04-25 |
![Star](https://img.shields.io/github/stars/SkyworkAI/Vitron.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![YouTube](https://badges.aleen42.com/src/youtube.svg)
![Wechat](https://img.shields.io/badge/-WeChat@%E6%96%B0%E6%99%BA%E5%85%83-000000?logo=wechat&logoColor=07C160) |
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation |
arXiv |
2024-04-22 |
![Star](https://img.shields.io/github/stars/AILab-CVC/SEED-X.svg?style=social&label=Star) |
- |
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models |
arXiv |
2024-04-19 |
![Star](https://img.shields.io/github/stars/FoundationVision/Groma.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
MoVA: Adapting Mixture of Vision Experts to Multimodal Context |
arXiv |
2024-04-19 |
![Star](https://img.shields.io/github/stars/TempleX98/MoVA.svg?style=social&label=Star) |
- |
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language Models |
arXiv |
2024-04-18 |
- |
![Project Page](https://img.shields.io/badge/chat-reka.ai-purple.svg)
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
AesExpert: Towards Multi-modality Foundation Model for Image Aesthetics Perception (AesExpert, AesMMIT Dataset) |
arXiv |
2024-04-15 |
![Star](https://img.shields.io/github/stars/yipoh/AesExpert.svg?style=social&label=Star) |
- |
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models (Ferret-v2) |
arXiv |
2024-04-11 |
- |
- |
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies (MiniCPM series) |
arXiv |
2024-04-09 |
![Star](https://img.shields.io/github/stars/OpenBMB/MiniCPM.svg?style=social&label=Star)
![Star](https://img.shields.io/github/stars/OpenBMB/MiniCPM-V.svg?style=social&label=Star) |
![Blog](https://img.shields.io/badge/Technical-Blog-orange.svg) |
Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs (Ferret-UI) |
arXiv |
2024-04-08 |
- |
- |
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
CVPR 2024 |
2024-04-08 |
![Star](https://img.shields.io/github/stars/boheumd/MA-LMM.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Koala: Key frame-conditioned long video-LLM |
CVPR 2024 |
2024-04-05 |
![Star](https://img.shields.io/github/stars/rxtan2/Koala-video-llm.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens |
arXiv |
2024-04-04 |
![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT4-video.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
LongVLM: Efficient Long Video Understanding via Large Language Models |
arXiv |
2024-04-04 |
![Star](https://img.shields.io/github/stars/ziplab/LongVLM.svg?style=social&label=Star) |
- |
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training |
arXiv |
2024-03-14 |
- |
- |
UniCode: Learning a Unified Codebook for Multimodal Large Language Models |
arXiv |
2024-03-14 |
- |
- |
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context |
arXiv |
2024-03-08 |
- |
![Project Page](https://img.shields.io/badge/Google-Gemini-blue.svg) |
Feast Your Eyes: Mixture-of-Resolution Adaptation for Multimodal Large Language Models |
arXiv |
2023-03-05 |
![Star](https://img.shields.io/github/stars/luogen1996/LLaVA-HR.svg?style=social&label=Star) |
- |
RegionGPT: Towards Region Understanding Vision Language Model |
CVPR 2024 |
2024-03-04 |
- |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
All in an Aggregated Image for In-Image Learning |
arXiv |
2024-02-28 |
![Star](https://img.shields.io/github/stars/AGI-Edgerunners/IIL.svg?style=social&label=Star) |
- |
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners |
CVPR 2024 |
2024-02-27 |
![Star](https://img.shields.io/github/stars/yzxing87/Seeing-and-Hearing.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
TMT: Tri-Modal Translation between Speech, Image, and Text by Processing Different Modalities as Different Languages |
arXiv |
2024-02-25 |
- |
- |
LLMBind: A Unified Modality-Task Integration Framework |
arXiv |
2024-02-22 |
- |
- |
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling |
arXiv |
2024-02-19 |
![Star](https://img.shields.io/github/stars/OpenMOSS/AnyGPT.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model (ALLaVA) |
arXiv |
2024-02-18 |
![Star](https://img.shields.io/github/stars/FreedomIntelligence/ALLaVA.svg?style=social&label=Star) |
![Demo Page](https://img.shields.io/badge/Demo-Page-purple.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model |
arXiv |
2024-02-06 |
![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star) |
- |
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization |
arXiv |
2024-02-05 |
![Star](https://img.shields.io/github/stars/jy0205/LaVIT.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action |
arXiv |
2023-12-28 |
![Star](https://img.shields.io/github/stars/allenai/unified-io-2.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices |
arXiv |
2023-12-28 |
![Star](https://img.shields.io/github/stars/Meituan-AutoML/MobileVLM.svg?style=social&label=Star) |
- |
Generative Multimodal Models are In-Context Learners (Emu2) |
CVPR 2024 |
2023-12-20 |
![Star](https://img.shields.io/github/stars/baaivision/Emu.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Gemini: A Family of Highly Capable Multimodal Models |
arXiv |
2023-12-19 |
- |
![Project Page](https://img.shields.io/badge/Google-Gemini-blue.svg) |
Osprey: Pixel Understanding with Visual Instruction Tuning |
CVPR 2024 |
2023-12-15 |
![Star](https://img.shields.io/github/stars/CircleRadon/Osprey.svg?style=social&label=Star) |
- |
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation |
arXiv |
2023-12-14 |
![Star](https://img.shields.io/github/stars/AILab-CVC/VL-GPT.svg?style=social&label=Star) |
- |
Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models |
arXiv |
2023-12-11 |
![Star](https://img.shields.io/github/stars/Ucas-HaoranWei/Vary.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Prompt Highlighter: Interactive Control for Multi-Modal LLMs |
CVPR 2024 |
2023-12-07 |
![Star](https://img.shields.io/github/stars/dvlab-research/Prompt-Highlighter.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
PixelLM: Pixel Reasoning with Large Multimodal Model |
CVPR 2024 |
2023-12-04 |
![Star](https://img.shields.io/github/stars/MaverickRen/PixelLM.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
APoLLo : Unified Adapter and Prompt Learning for Vision Language Models |
EMNLP 2023 |
2023-12-04 |
![Star](https://img.shields.io/github/stars/schowdhury671/APoLLo.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation |
arXiv |
2023-11-30 |
![Star](https://img.shields.io/github/stars/microsoft/i-Code.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models |
arXiv |
2023-11-28 |
![Star](https://img.shields.io/github/stars/dvlab-research/LLaMA-VID.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
LLMGA: Multimodal Large Language Model based Generation Assistant |
arXiv |
2023-11-27 |
![Star](https://img.shields.io/github/stars/dvlab-research/LLMGA.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
PG-Video-LLaVA: Pixel Grounding Large Video-Language Models |
arXiv |
2023-11-22 |
![Star](https://img.shields.io/github/stars/mbzuai-oryx/Video-LLaVA.svg?style=social&label=Star) |
- |
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions |
arXiv |
2023-11-21 |
![Star](https://img.shields.io/github/stars/InternLM/InternLM-XComposer.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge |
CVPR 2024 |
2023-11-20 |
![Star](https://img.shields.io/github/stars/rshaojimmy/JiuTian.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Video-LLaVA: Learning United Visual Representation by Alignment Before Projection |
arXiv |
2023-11-16 |
![Star](https://img.shields.io/github/stars/PKU-YuanGroup/Video-LLaVA.svg?style=social&label=Star) |
- |
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration |
arXiv |
2023-11-07 |
![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star) |
- |
MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning |
arXiv |
2023-10-14 |
![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
EasyGen: Easing Multimodal Generation with a Bidirectional Conditional Diffusion Model and LLMs |
arXiv |
2023-10-13 |
![Star](https://img.shields.io/github/stars/zxy556677/EasyGen.svg?style=social&label=Star) |
- |
Ferret: Refer and Ground Anything Anywhere at Any Granularity (Ferret) |
ICLR 2024 |
2023-10-11 |
![Star](https://img.shields.io/github/stars/apple/ml-ferret.svg?style=social&label=Star) |
- |
Mini-DALLE3: Interactive Text to Image by Prompting Large Language Models |
arXiv |
2023-10-11 |
![Star](https://img.shields.io/github/stars/Zeqiang-Lai/Mini-DALLE3.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) |
arXiv |
2023-10-05 |
![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Kosmos-G: Generating Images in Context with Multimodal Large Language Models |
ICLR 2024 |
2023-10-04 |
![Star](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens |
arXiv |
2023-10-03 |
![Star](https://img.shields.io/github/stars/eric-ai-lab/MiniGPT-5.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Aligning Large Multimodal Models with Factually Augmented RLHF (LLaVA-RLHF, MMHal-Bench (hallucination)) |
arXiv |
2023-09-25 |
![Star](https://img.shields.io/github/stars/llava-rlhf/LLaVA-RLHF.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
DreamLLM: Synergistic Multimodal Comprehension and Creation |
ICLR 2024 |
2023-09-20 |
![Star](https://img.shields.io/github/stars/RunpeiDong/DreamLLM.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |
ICLR 2024 |
2023-09-14 |
![Star](https://img.shields.io/github/stars/HaozheZhao/MIC.svg?style=social&label=Star) |
- |
NExT-GPT: Any-to-Any Multimodal LLM |
arXiv |
2023-09-11 |
![Star](https://img.shields.io/github/stars/NExT-GPT/NExT-GPT.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization (LaVIT) |
ICLR 2024 |
2023-09-09 |
![Star](https://img.shields.io/github/stars/jy0205/LaVIT.svg?style=social&label=Star) |
- |
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond |
arXiv |
2023-08-24 |
![Star](https://img.shields.io/github/stars/QwenLM/Qwen-VL.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/%E9%98%BF%E9%87%8C-%E9%80%9A%E4%B9%89%E5%8D%83%E9%97%AE-blue.svg) |
Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages (VisCPM-Chat/Paint) |
ICLR 2024 |
2023-08-23 |
![Star](https://img.shields.io/github/stars/OpenBMB/VisCPM.svg?style=social&label=Star) |
- |
Planting a SEED of Vision in Large Language Model |
ICLR 2024 |
2023-07-16 |
![Star](https://img.shields.io/github/stars/AILab-CVC/SEED.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Generative Pretraining in Multimodality (Emu1) |
ICLR 2024 |
2023-07-11 |
![Star](https://img.shields.io/github/stars/baaivision/Emu.svg?style=social&label=Star) |
- |
SVIT: Scaling up Visual Instruction Tuning |
arXiv |
2023-07-09 |
![Star](https://img.shields.io/github/stars/BAAI-DCAI/Visual-Instruction-Tuning.svg?style=social&label=Star) |
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
Kosmos-2: Grounding Multimodal Large Language Models to the World (Kosmos-2, GrIT Dataset) |
arXiv |
2023-06-26 |
![Star](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social&label=Star) |
![Demo](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning |
arXiv |
2023-06-07 |
- |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
Generating Images with Multimodal Language Models (GILL) |
NeurIPS 2023 |
2023-05-26 |
![Star](https://img.shields.io/github/stars/kohjingyu/gill.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Any-to-Any Generation via Composable Diffusion (CoDi-1) |
NeurIPS 2023 |
2023-05-19 |
![Star](https://img.shields.io/github/stars/microsoft/i-Code.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities |
EMNLP 2023 (Findings) |
2023-05-18 |
![Star](https://img.shields.io/github/stars/0nutation/SpeechGPT.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning |
NeurIPS 2023 |
2023-05-11 |
![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star) |
- |
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans |
arXiv |
2023-05-08 |
![Star](https://img.shields.io/github/stars/open-mmlab/Multimodal-GPT.svg?style=social&label=Star) |
- |
VPGTrans: Transfer Visual Prompt Generator across LLMs |
NeurIPS 2023 |
2023-05-02 |
![Star](https://img.shields.io/github/stars/VPGTrans/VPGTrans.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality |
arXiv |
2023-04-27 |
![Star](https://img.shields.io/github/stars/X-PLUG/mPLUG-Owl.svg?style=social&label=Star) |
- |
MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models |
ICLR 2024 |
2023-04-20 |
![Star](https://img.shields.io/github/stars/Vision-CAIR/MiniGPT-4.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
Visual Instruction Tuning (LLaVA) |
NeurIPS 2023 |
2023-04-17 |
![Star](https://img.shields.io/github/stars/haotian-liu/LLaVA.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg)
![Dataset](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-blue) |
Language Is Not All You Need: Aligning Perception with Language Models (Kosmos-1) |
NeurIPS 2023 |
2023-02-27 |
![Star](https://img.shields.io/github/stars/microsoft/unilm.svg?style=social&label=Star) |
- |
Multimodal Chain-of-Thought Reasoning in Language Models |
arXiv |
2023-02-02 |
![Star](https://img.shields.io/github/stars/amazon-science/mm-cot.svg?style=social&label=Star) |
- |
Grounding Language Models to Images for Multimodal Inputs and Outputs (FROMAGe) |
ICML 2023 |
2023-01-31 |
![Star](https://img.shields.io/github/stars/kohjingyu/fromage.svg?style=social&label=Star) |
![Project Page](https://img.shields.io/badge/Project-Page-green.svg) |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
ICML 2023 |
2023-01-30 |
![Star](https://img.shields.io/github/stars/salesforce/LAVIS.svg?style=social&label=Star) |
- |
Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS 2022 |
2022-04-29 |
![Star](https://img.shields.io/github/stars/mlfoundations/open_flamingo.svg?style=social&label=Star) |
- |