Visual Relation Grounding in Videos

This is the pytorch implementation of our work at ECCV2020 (Spotlight). teaser The repository mainly includes 3 parts: (1) Extract RoI feature; (2) Train and inference; and (3) Generate relation-aware trajectories.

Notes

Fix issue on unstable result [2021/10/07].

Environment

Anaconda 3, python 3.6.5, pytorch 0.4.1 (Higher version is OK once feature is ready) and cuda >= 9.0. For others libs, please refer to the file requirements.txt.

Install

Please create an env for this project using anaconda3 (should install anaconda first)

>conda create -n envname python=3.6.5 # Create
>conda activate envname # Enter
>pip install -r requirements.txt # Install the provided libs
>sh vRGV/lib/make.sh # Set the environment for detection, make sure you have nvcc

Data Preparation

Please download the data here. The folder ground_data should be at the same directory as vRGV. Please merge the downloaded vRGV folder with this repo.

Please download the videos here and extract the frames into ground_data. The directory should be like: ground_data/vidvrd/JPEGImages/ILSVRC2015_train_xxx/000000.JPEG.

Usage

Feature Extraction. (need about 100G storage! Because I dumped all the detected bboxes along with their features. It can be greatly reduced by changing detect_frame.py to return the top-40 bboxes and save them with .npz file.)

./detection.sh 0 val #(or train)

Sample video features:

cd tools
python sample_video_feature.py

Test. You can use our provided model to verify the feature and environment:

./ground.sh 0 val # Output the relation-aware spatio-temporal attention
python generate_track_link.py # Generate relation-aware trajectories with Viterbi algorithm.
python eval_ground.py # Evaluate the performance

You will get accuracy Acc_R: 24.58%.

Train. If you want to train the model from scratch. Please apply a two-stage training scheme: 1) train a basic model without relation attendance, and 2) load the reconstruction part of the pre-trained model to learn the whole model (with the same lr_rate). For implementation, please turn off/on [pretrain] in line 52 of ground.py, and switch between line 6 & 7 in ground_relation.py for 1st & 2nd stage training respectively. Also, you need to change the model files in line 69 & 70 of ground_relation.py to the best model obtained at the first stage for 2nd-stage training.

./ground.sh 0 train # Train the model with GPU id 0

The results maybe slightly different (+/-0.5%), For comparison, please follow the results reported in our paper.

Result Visualization

Query	bicycle-jump_beneath-person	person-feed-elephant	person-stand_above-bicycle	dog-watch-turtle
Result
Query	person-ride-horse	person-ride-bicycle	person-drive-car	bicycle-move_toward-car
Result

Citation

@inproceedings{xiao2020visual,
  title={Visual Relation Grounding in Videos},
  author={Xiao, Junbin and Shang, Xindi and Yang, Xun and Tang, Sheng and Chua, Tat-Seng},
  booktitle={European Conference on Computer Vision},
  pages={447--464},
  year={2020},
  organization={Springer}
}

vRGV
vRGV copied to clipboard

Metadata

Visual Relation Grounding in Videos

Notes

Environment

Install

Data Preparation

Usage

Result Visualization

Citation

License

← Metadata

Owner

Metadata

vRGV vRGV copied to clipboard

Metadata

Visual Relation Grounding in Videos

Notes

Environment

Install

Data Preparation

Usage

Result Visualization

Citation

License

← Metadata

Owner

Metadata

vRGV
vRGV copied to clipboard