multimodal-large-language-models topic
Ovis
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
Parrot
🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.
lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.
EVF-SAM
Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"
multimodal-chat
A multimodal chat interface with many tools.
LLaMA-Omni
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
VITA
✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM
Video-MME
✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
MLVU
🔥🔥MLVU: Multi-task Long Video Understanding Benchmark
ml-slowfast-llava
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models