R3Transformer
R3Transformer copied to clipboard
Official python implementation of R3-Transformer
R3-Transformer
This is the official code release for R3-Transformer proposed in Neuro-Symbolic Representations for Video Captioning: A Case for Leveraging Inductive Biases for Vision and Language.
Installation
Option (I)
All dependencies are included in the original model's container. First install the latest docker. Then pull our docker image by:
docker pull hassanhub/vid_cap:latest
Then run the container by:
docker run --gpus all --name r3_container -it -v /home/
Note: This image already includes CUDA-related drivers and dependencies.
Option (II)
Alternatively, you can create your own environment and make sure the following dependencies are installed:
-
Python 3.7/3.8
-
Tensorflow 2.3
-
CUDA 10.1
-
NVIDIA Driver v 440.100
-
CuDNN 7.6.5
-
opencv-python
-
h5py
-
transformers
-
matplotlib
-
scikit-image
-
nvidia-ml-py3
-
decord
-
pandas
-
tensorcore.dataflow
Data Preparation
In order to speed-up data infeed, we utilize a multi-chunk hdf5 format. There are two options for getting data prepared for train/evaluation.
Option (I)
Download pre-extracted features using SlowFast-50-8x8 pre-trained on Kinetics 400 from this link:
- Parts 0-10 (coming soon...)
Option (II)
Alternatively, you can follow these steps to extract a customized version of features using your own visual backbone:
- Download YouCook II
- Download ActivityNet Captions
- Pre-process raw video files using this script
- Extract visual features using your visual backbone or our pre-trained SlowFast-50-8x8 using this script
- Store features and captions in a multi-chunk hdf5 format using this script