vision-language topic
awesome-japanese-llm
日本語LLMまとめ - Overview of Japanese LLMs
DriveLM
[ECCV 2024 Oral] DriveLM: Driving with Graph Visual Question Answering
VLTinT
[AAAI 2023 Oral] VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning
awesome-video-text-datasets
A curated list of video-text datasets in a variety of languages. These datasets can be used for video captioning (video description) or video retrieval.
ONE-PEACE
A general representation model across vision, audio, language modalities. Paper: ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
AlphaCLIP
[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Awesome-Long-Context
A curated list of resources about long-context in large-language models and video understanding.
Proto-CLIP
Code release for Proto-CLIP: Vision-Language Prototypical Network for Few-Shot Learning
SEED
Official implementation of SEED-LLaMA (ICLR 2024).
vision-language-models-are-bows
Experiments and data for the paper "When and why vision-language models behave like bags-of-words, and what to do about it?" Oral @ ICLR 2023