multimodal-large-language-models topic

List multimodal-large-language-models repositories

Ovis

1.4k

Stars

83

Forks

1.4k

Watchers

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

multimodal-large-language-models

Parrot

25

Stars

1

Forks

Watchers

🎉 The code repository for "Parrot: Multilingual Visual Instruction Tuning" in PyTorch.

mixture-of-experts

multimodal-large-language-models

vision-language-model

lmms-finetune

357

Stars

41

Forks

357

Watchers

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.

foundation-models

instruction-tuning

large-language-model

EVF-SAM

250

Stars

8

Forks

Watchers

Official code of "EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model"

multimodal-large-language-models

referring-image-segmentation

segment-anything

multimodal-chat

88

Stars

11

Forks

Watchers

A multimodal chat interface with many tools.

anthropic-claude

LLaMA-Omni

2.0k

Stars

111

Forks

Watchers

LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.

large-language-models

multimodal-large-language-models

speech-interaction

speech-language-model

VITA

801

Stars

41

Forks

Watchers

✨✨VITA: Towards Open-Source Interactive Omni Multimodal LLM

large-multimodal-models

multimodal-large-language-models

Video-MME

695

Stars

25

Forks

695

Watchers

✨✨[CVPR 2025] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

large-language-models

large-vision-language-models

multimodal-large-language-models

MLVU

139

Stars

0

Forks

Watchers

🔥🔥MLVU: Multi-task Long Video Understanding Benchmark

long-video-understanding

ml-slowfast-llava

129

Stars

9

Forks

Watchers

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

multimodal-large-language-models

video-question-answering