Vision and Language Group@ MIL
Vision and Language Group@ MIL
mcan-vqa
Deep Modular Co-Attention Networks for Visual Question Answering
openvqa
A lightweight, scalable, and general framework for visual question answering research
bottom-up-attention.pytorch
A PyTorch reimplementation of bottom-up-attention models
activitynet-qa
An VideoQA dataset based on the videos from ActivityNet
mt-captioning
A PyTorch implementation of the paper Multimodal Transformer with Multiview Visual Representation for Image Captioning
rosita
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
prophet
Implementation of CVPR 2023 paper "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering".