vision-language topic
QualiCLIP
Quality-Aware Image-Text Alignment for Opinion-Unaware Image Quality Assessment
MSQNet
Actor-agnostic Multi-label Action Recognition with Multi-modal Query [ICCVW '23]
SPEC
[CVPR 2024] The official implementation of paper "synthesize, diagnose, and optimize: towards fine-grained vision-language understanding"
LLaVA-pp
🔥🔥 LLaVA++: Extending LLaVA with Phi-3 and LLaMA-3 (LLaVA LLaMA-3, LLaVA Phi-3)
PMA-Net
With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023
VideoGPT-plus
Official Repository of paper VideoGPT+: Integrating Image and Video Encoders for Enhanced Video Understanding
rscir
Official PyTorch implementation and benchmark dataset for IGARSS 2024 ORAL paper: "Composed Image Retrieval for Remote Sensing"
Qwen2-VL-Finetune
An open-source implementaion for fine-tuning Qwen2-VL series by Alibaba Cloud.
VLM-Visualizer
Visualizing the attention of vision-language models
lmms-finetune
A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, llama-3.2-vision, qwen-vl, qwen2-vl, phi3-v etc.