hcrn-videoqa
hcrn-videoqa copied to clipboard
Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)
Hierarchical Conditional Relation Networks for Video Question Answering (HCRN-VideoQA)
We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that encapsulates and transforms an array of tensorial objects into a new array of the same kind, conditioned on a contextual feature. The flexibility of CRN units is then examined in solving Video Question Answering, a challenging problem requiring joint comprehension of video content and natural language processing.
Illustrations of CRN unit and the result of model building HCNR for VideoQA:
CRN Unit | HCRN Architecture |
---|---|
![]() |
![]() |
Check out our paper for details.
Setups
- Clone the repository:
git clone https://github.com/thaolmk54/hcrn-videoqa.git
-
Download TGIF-QA, MSRVTT-QA, MSVD-QA dataset and edit absolute paths in
preprocess/preprocess_features.py
andpreprocess/preprocess_questions.py
upon where you locate your data. Default paths are with/ceph-g/lethao/datasets/{dataset_name}/
. -
Install dependencies:
conda create -n hcrn_videoqa python=3.6
conda activate hcrn_videoqa
conda install -c conda-forge ffmpeg
conda install -c conda-forge scikit-video
pip install -r requirements.txt
Experiments with TGIF-QA
Depending on the task to chose question_type
out of 4 options: action, transition, count, frameqa
.
Preprocessing visual features
- To extract appearance feature:
python preprocess/preprocess_features.py --gpu_id 2 --dataset tgif-qa --model resnet101 --question_type {question_type}
-
To extract motion feature:
Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it to
data/preprocess/pretrained/
.
python preprocess/preprocess_features.py --dataset tgif-qa --model resnext101 --image_height 112 --image_width 112 --question_type {question_type}
Note: Extracting visual feature takes a long time. You can download our pre-extracted features from here and save them in data/tgif-qa/{question_type}/
. Please use the following command to join split files:
cat tgif-qa_{question_type}_appearance_feat.h5.part* > tgif-qa_{question_type}_appearance_feat.h5
Proprocess linguistic features
- Download glove pretrained 300d word vectors to
data/glove/
and process it into a pickle file:
python txt2pickle.py
- Preprocess train/val/test questions:
python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --glove_pt data/glove/glove.840.300d.pkl --mode train
python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --mode test
Training
Choose a suitable config file in configs/{task}.yml
for one of 4 tasks: action, transition, count, frameqa
to train the model. For example, to train with action task, run the following command:
python train.py --cfg configs/tgif_qa_action.yml
Evaluation
To evaluate the trained model, run the following:
python validate.py --cfg configs/tgif_qa_action.yml
Note: Pretrained model for action task is available here. Save the file in results/expTGIF-QAAction/ckpt/
for evaluation.
Experiments with MSRVTT-QA and MSVD-QA
The following is to run experiments with MSRVTT-QA dataset, replace msrvtt-qa
with msvd-qa
to run with MSVD-QA dataset.
Preprocessing visual features
- To extract appearance feature:
python preprocess/preprocess_features.py --gpu_id 2 --dataset msrvtt-qa --model resnet101
- To extract motion feature:
python preprocess/preprocess_features.py --dataset msrvtt-qa --model resnext101 --image_height 112 --image_width 112
Proprocess linguistic features
Preprocess train/val/test questions:
python preprocess/preprocess_questions.py --dataset msrvtt-qa --glove_pt data/glove/glove.840.300d.pkl --mode train
python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode val
python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode test
Training
python train.py --cfg configs/msrvtt_qa.yml
Evaluation
To evaluate the trained model, run the following:
python validate.py --cfg configs/msrvtt_qa.yml
Citations
If you make use of this repository for your research, please cite the following paper:
@article{le2020hierarchical,
title={Hierarchical Conditional Relation Networks for Video Question Answering},
author={Le, Thao Minh and Le, Vuong and Venkatesh, Svetha and Tran, Truyen},
journal={arXiv preprint arXiv:2002.10698},
year={2020}
}