EgoVLP: Egocentric Video-Language Pretraining

Project page | arXiv

TL;DR: We pioneer Egocentric Video-Language Pretraining from pretraining dataset, model and development benchmark; the resulted pretrained model exhibits strong performance on five downstream tasks across three egocentric datasets.

📢 News

[2022.6.3] We release the arXiv paper.
[2022.6.10] We release the EgoClip pretraining dataset.
[2022.6.20] Our EgoVLP won 1st place in OSCC & 2nd place in NLQ & 3rd place in PNR @ Ego4D Challenge 2022, and 1st place in Multi-Instance Retrieval @ EPIC-Kitchens Challenge 2022, hosted by CVPR 2022.
[2022.6.30] We release the first version of the EgoVLP codebase.

📝 Preparation

Install dependencies

conda create -n egovlp python=3.6
source activate egovlp
cd [Path_To_This_Code]
pip install -r requirements.txt

Ego4D videos and metadata

You may skip the source video download if pretraining is not required.

Follow the guideline here, download the following to {PATH_TO_EGO4D}
- Ego4D source videos (nearly 7 TB).
- Ego4D videos metadata manifest.csv and benchmark metadata, e.g., nlq_train.json for NLQ.
- Create the dir ./dataset and add a soft link by ln -s {PATH_TO_EGO4D} ./dataset/ego4d.
For effectively pretraining, we compress videos in the following way:
- Resize the source videos with a short size equal to 256 by script ./utils/video_resize.py.
- Chunk the resized videos to multiple segments (up to 600 sec) by script ./utils/video_chunk.py.

EgoClip

Download the EgoClip metadata from here and put it to ./dataset/egoclip.csv.

For the usage of EgoClip, please refer to ./data_loader/EgoClip_EgoMCQ_dataset.py. The data format of EgoClip is:

import pandas as pd

metadata = pd.read_csv('./dataset/egoclip_metadata.csv', sep='\t', error_bad_lines=False)
print(metadata.shape[0])
print(metadata.iloc[0])

# Out:
3847723                                                         # Num of clips for EgoClip

clip_idx                                                     0  # the idx of clip
video_uid                 001e3e4e-2743-47fc-8564-d5efd11f9e90  # the uid of source video
video_dur                                           128.033333  # the duration of source video
narration_source                              narration_pass_1  # the source of annotator
narration_ind                                                0  # the idx of narration
narration_time                                          3.3445  # the narration timestamp
clip_start                                            2.967651  # the start timestamp of clip
clip_end                                              3.721266  # the end timestamp of clip
clip_text           #C C picks a bag of clothes from the floor  # the narration of clip
tag_verb                                                  [93]  # the verb idx of the narration
tag_noun                                        [192, 115, 12]  # the noun idx of the narration

For the usage of tag_verb and tag_noun, please refer to here.

EgoMCQ

Download the EgoMCQ metadata from here and put it to ./dataset/egomcq.json.
For the usage of EgoMCQ, please refer to ./data_loader/EgoClip_EgoMCQ_dataset.py.

🏋️‍️ Pretraining

This code is built on PyTorch with DistributedDataParallel (DDP). We pretrain EgoVLP on 4 nodes, each with 8 A100 GPUs (10 epochs in about two days).

Train on EgoClip: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_egoclip.py --config ./configs/pt/egoclip.json
Test on EgoMCQ: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --master_addr $CHIEF_IP --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_egoclip.py --config ./configs/eval/egomcq.json
Monitor the EgoMCQ curve during pretraining: tensorboard --logdir ./results --bind_all

🗄 Pretrained Weights

We have released our pretrained EgoVLP model (EgoClip w/ EgoNCE) with best performance on EgoMCQ (90.7% inter-video & 57.2% intra-video) in EgoVLP_PT_BEST.

This checkpoint is used for EPIC-Kitchens, NLQ, MQ, OSSC, and PNR tasks, except for Charades-Ego. Since we found that VLP (CC3M+WebVid2M, EgoClip) alway degrades significantly on Charades-Ego after the first epoch, we evaluate Charades-Ego using the first pretraining epoch weights of EgoVLP in EgoVLP_PT_EPO1.

🔧 Downstream Tasks

EPIC-Kitchens MIR

Preparation:

Follow the instruction here, download the EPIC-Kitchens dataset (RGB frames) and annotation.
Follow the instruction here -> How do I create the relevance matrix? to construct a relevance matrix for evaluation.

Results:

Model	Mode	# Frames	Video-Text PT	Weights	mAP (V2T)	mAP (T2V)	mAP (Avg)	nDCG (V2T)	nDCG (T2V)	nDCG (Avg)
EgoVLP	Zero-shot	4	EgoClip w/ EgoNCE	EgoVLP_PT_BEST	19.4	13.9	16.6	24.1	22.0	23.1
EgoVLP	Fine-tuning w/ MI-MM	16	EgoClip w/ EgoNCE	EgoVLP_FT_EPIC	49.9	40.5	45.0	60.9	57.9	59.4
EgoVLP*	Fine-tuning w/ Adaptive-MI-MM + Dual-softmax	16	EgoClip w/ EgoNCE	EgoVLP_FT_EPIC*	53.8	40.9	47.4	63.3	59.6	61.4

EgoVLP* means our submission for Multi-Instance Retrieval@EPIC-Kitchens Challenge 2022

Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_epic.py --config ./configs/ft/epic.json
Test: python3 ./run/test_epic.py

Charades-Ego

Preparation:

Follow the instruction here, download the Charades-Ego dataset (480p) and annotation.
Create a training metadata via ./utils/charades_meta.py

Results:

Model	Mode	# Frames	Video-Text PT	Weights	mAP
EgoVLP	Zero-shot	16	EgoClip w/ EgoNCE	EgoVLP_PT_EPO1	25.0
EgoVLP	Fine-tuning w/ InfoNCE	16	EgoClip w/ EgoNCE	EgoVLP_FT_CHARADES	32.1

Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_epic.py --config ./configs/ft/charades.json
Test: python3 ./run/test_charades.py

NLQ

Preparation: Make sure you have prepared the NLQ videos and metadata.
Extract video features: python3 ./run/test_nlq.py --subsample 'text'.
Extract text features: python3 ./run/test_nlq.py --subsample 'video'.
Fine-tune the VSLNet by replacing its input video-text features.

MQ

Preparation: Make sure you have prepared the MQ videos and metadata.
Extract video features: python3 ./run/test_mq.py.
Fine-tune the VSGN by replacing its input video features.

OSCC

Preparation:

Make sure you have prepared the OSCC videos and metadata.
Extract the clip frame follow the instruction here -> Data Preparation.

Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_oscc.py --config ./configs/ft/oscc.json

PNR

Preparation: Same as OSCC.
Train: python3 -m torch.distributed.launch --nnodes=$HOST_NUM --node_rank=$INDEX --nproc_per_node $HOST_GPU_NUM --master_port 8081 ./run/train_pnr.py --config ./configs/ft/pnr.json

🎓 Citation

If you find our work helps, please cite our paper.

@article{kevin2022egovlp,
	title={Egocentric Video-Language Pretraining},
	author={Kevin Qinghong Lin and Alex Jinpeng Wang and Mattia Soldan and Michael Wray and Rui Yan and Eric Zhongcong Xu and Difei Gao and Rongcheng Tu and Wenzhe Zhao and Weijie Kong and Chengfei Cai and Hongfa Wang and Dima Damen and Bernard Ghanem and Wei Liu and Mike Zheng Shou},
	journal={arXiv preprint arXiv:2206.01670},
	year={2022}
}

✉️ Contact

This repo is maintained by Kevin. Questions and discussions are welcome via [email protected].

We are willing to merge results and codes if transfer our EgoVLP to other egocentric tasks or datasets.

🙏 Acknowledgements

This codebase is based on Frozen.

Thanks to Alex for the help with DDP and Mattia for the help with NLQ and MQ benchmarks.

LICENSE

MIT

EgoVLP
EgoVLP copied to clipboard

Metadata

EgoVLP: Egocentric Video-Language Pretraining

📢 News

📝 Preparation

Install dependencies

Ego4D videos and metadata

EgoClip

EgoMCQ

🏋️‍️ Pretraining

🗄 Pretrained Weights

🔧 Downstream Tasks

EPIC-Kitchens MIR

Charades-Ego

NLQ

MQ

OSCC

PNR

🎓 Citation

✉️ Contact

🙏 Acknowledgements

LICENSE

← Metadata

Owner

Metadata

EgoVLP EgoVLP copied to clipboard

Metadata

EgoVLP: Egocentric Video-Language Pretraining

📢 News

📝 Preparation

Install dependencies

Ego4D videos and metadata

EgoClip

EgoMCQ

🏋️‍️ Pretraining

🗄 Pretrained Weights

🔧 Downstream Tasks

EPIC-Kitchens MIR

Charades-Ego

NLQ

MQ

OSCC

PNR

🎓 Citation

✉️ Contact

🙏 Acknowledgements

LICENSE

← Metadata

Owner

Metadata

EgoVLP
EgoVLP copied to clipboard