vision-language topic
mix-generation
MixGen: A New Multi-Modal Data Augmentation
WaffleCLIP
Official repository for the ICCV 2023 paper: "Waffling around for Performance: Visual Classification with Random Words and Broad Concepts"
ARP
Guide Your Agent with Adaptive Multimodal Rewards (NeurIPS 2023 Accepted)
OpenFusion
[ICRA 2024 Oral] Open-Fusion: Real-time Open-Vocabulary 3D Mapping and Queryable Scene Representation
SOONet
Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos
HQGA
Video as Conditional Graph Hierarchy for Multi-Granular Question Answering (AAAI'22, Oral)
Shot2Story
A new multi-shot video understanding benchmark Shot2Story with comprehensive video summaries and detailed shot-level captions.
debias-vision-lang
A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning [AACL 2022]
BagFormer
PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise interaction
PoS-subspaces
[NeurIPS'23] Parts of Speech–Grounded Subspaces in Vision-Language Models