vision-and-language topic
video_captioning_datasets
Summary about Video-to-Text datasets. This repository is part of the review paper *Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review*
clevr-dialog
Repository to generate CLEVR-Dialog: A diagnostic dataset for Visual Dialog
FrozenBiLM
[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
rosita
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
IAIS
[ACL 2021] Learning Relation Alignment for Calibrated Cross-modal Retrieval
hulc
Hierarchical Universal Language Conditioned Policies
Explore-And-Match
Official pytorch implementation of "Explore-And-Match: Bridging Proposal-Based and Proposal-Free With Transformer for Sentence Grounding in Videos"
robo-vln
Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"
Xmodal-Ctx
Official PyTorch implementation of our CVPR 2022 paper: Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
X2-VLM
All-In-One VLM: Image + Video + Transfer to Other Languages / Domains (TPAMI 2023)