Can I Trust Your Answer? Visually Grounded Video Question Answering

Introduction

We study visually grounded VideoQA by forcing vision-language models (VLMs) to answer questions and simultaneously ground the relevant video moments as visual evidences. We show that this task is easy for human yet is extremely challenging for existing VLMs, revealing that the strong QA performance of these models are actually derived from short-cut learning (e.g., language priors and spurious vision-text correlations) versus faithful multimodal reasoning. By defining grounded VQA, we hope to discourage such short-cut learning and spark more interpretable and trustworthy techniques. This repository holds our data and code to facilitate the study.

Environment

Assume you have installed Anaconda, please do the following to setup the environment:

>conda create -n videoqa python==3.8
>conda activate videoqa
>conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=11.1 -c pytorch -c nvidia 
>git clone https://github.com/doc-doc/NExT-GQA.git
>pip install -r requirements.txt

Preparation

Please create a data folder outside this repo, so you have two folders in your workspace 'workspace/data/' and 'workspace/NExT-GQA/'.

Please download the related video feature or raw videos. Extract the feature into workspace/data/nextqa/CLIPL/. If you download the raw videos, you need to decode each video at 6fps and then extract the frame feature of CLIP via the script provided in code/TempCLIP/tools/extract_feat.sh.

Please follow the instructions in code for training and testing the respective models.

Result Visualization (NExT-GQA)

Citation

@inproceedings{xiao2023nextgqa,
  title={Can I Trust Your Answer? Visually Grounded Video Question Answering},
  author={Xiao, Junbin and Angela, Yao and Li, Yicong and Chua, Tat-Seng},
  booktitle={arXiv},
  pages={preprint},
  year={2023},
}

NExT-GQA
NExT-GQA copied to clipboard

Metadata

Can I Trust Your Answer? Visually Grounded Video Question Answering

Environment

Preparation

Result Visualization (NExT-GQA)

Citation

← Metadata

Owner

Metadata

NExT-GQA NExT-GQA copied to clipboard

Metadata

Can I Trust Your Answer? Visually Grounded Video Question Answering

Environment

Preparation

Result Visualization (NExT-GQA)

Citation

← Metadata

Owner

Metadata

NExT-GQA
NExT-GQA copied to clipboard