visual-language-learning topic
LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Otter
🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
BLIVA
(AAAI 2024) BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions
llava-docker
Docker image for LLaVA: Large Language and Vision Assistant
NExT-GPT
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
InternLM-XComposer
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
RLHF-V
[CVPR'24] RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
KarmaVLM
🧘🏻♂️KarmaVLM (相生):A family of high efficiency and powerful visual language model.
Open-LLaVA-NeXT
An open-source implementation for training LLaVA-NeXT.
llama-multimodal-vqa
Multimodal Instruction Tuning for Llama 3