EgoVLP icon indicating copy to clipboard operation
EgoVLP copied to clipboard

Commands to MQ Training with VSGN

Open JunweiLiang opened this issue 1 year ago • 9 comments

Hi, thanks for releasing the code!

Could you provide some instructions on how to run VSGN training with EgoVLP features (hyper-parameters, learning rate, etc.)? Thanks!

Junwei

JunweiLiang avatar Jul 08 '22 03:07 JunweiLiang

Hello Junwei,

Thanks for your interest for our work, I will update the instruction and related details for MQ next,

Thank you for your patience!

QinghongLin avatar Jul 19 '22 09:07 QinghongLin

Hi Junwei,

I have uploaded the video features for MQ tasks to G drive: train&val / test, so that you can download it directly. What you need to do is replace the input features with our features. and I have attached our config of the best VSGN model in here config.txt.

Please try it out and let us know if you have new results.

QinghongLin avatar Aug 18 '22 06:08 QinghongLin

I have downloaded the features but they seem to be a single file. Are they a single pickle binary with dictionary keys? How to read them and map them to the videos (for example, slowfast8x8_r101_k400/ has 9645 *.pt files each corresponds to a video)?

Thanks.

JunweiLiang avatar Aug 18 '22 08:08 JunweiLiang

There is a gz file, after unzipping it (I unzip it on my mac), you will see a document that contains multiple *.pt. e.g., 0a8f6747-7f79-4176-85ca-f5ec01a15435.pt, this pt file corresponding to the video features of the clip: 0a8f6747-7f79-4176-85ca-f5ec01a15435.

The clip information is provided by the MQ metadata, i.e., clip xxx come from the video yyy with start time t1 and end time t2.

QinghongLin avatar Aug 18 '22 08:08 QinghongLin

I see. The file you provided on Google drive is a .tar.gz file, and I extract it with tar -zxf and got 2034 *.pt file for the train/val part. Will try them.

JunweiLiang avatar Aug 18 '22 08:08 JunweiLiang

So 0a8f6747-7f79-4176-85ca-f5ec01a15435 is the clip ID instead of video ID? Could you provide the feature files of the whole video as the VSGN baseline? They read the feature of the whole video and then cut the corresponding clip (see here). To follow your instructions I would need this video-level features.

Thanks.

JunweiLiang avatar Aug 18 '22 08:08 JunweiLiang

Yes, it is the clip ID. And sorry, I am currently unable to provide video-level features, a solution is to rewrite the data loader so that supports clip features as input.

QinghongLin avatar Aug 18 '22 11:08 QinghongLin

@QinghongLin - Thanks for providing the clip features. I tried training the VSGN model using the Ego4D episodic-memory codebase instructions. But I'm not able to reproduce the val results from the paper. The numbers are quite a bit lower than the paper results (2nd row vs. 3rd row in the figure below). image

Here is the training command I used. Note: I modified the data loader to use clip features instead of video features.

 python Train.py \
     --use_xGPN \
     --is_train true \
     --dataset ego4d \
     --feature_path data/egovlp_feats_official \
     --checkpoint_path checkpoints/ \
     --tb_dir tb/ \
     --batch_size 24 \
     --train_lr 0.00005 \
     --use_clip_features true \
     --input_feat_dim 256 \
     --num_epoch 100

srama2512 avatar Sep 15 '22 04:09 srama2512

Hi, @srama2512 , I released the codebase here MQ.zip, you can check the data loader detail regarding clip-level feature loading. Besides, I am able to check the config parameters, can you have a try at the following parameters?

{'dataset': 'ego4d', 'is_train': 'true', 'out_prop_map': 'true', 'feature_path': '/mnt/sdb1/Datasets/Ego4d/action_feature_canonical', 'clip_anno': 'Evaluation/ego4d/annot/clip_annotations.json', 'moment_classes': 'Evaluation/ego4d/annot/moment_classes_idx.json', 'checkpoint_path': 'checkpoint', 'output_path': './outputs/hps_search_egovlp_egonce_features/23/', 'prop_path': 'proposals', 'prop_result_file': 'proposals_postNMS.json', 'detect_result_file': 'detections_postNMS.json', 'retrieval_result_file': 'retreival_postNMS.json', 'detad_sensitivity_file': 'detad_sensitivity', 'batch_size': 32, 'train_lr': 5e-05, 'weight_decay': 0.0001, 'num_epoch': 50, 'step_size': 15, 'step_gamma': 0.1, 'focal_alpha': 0.01, 'nms_alpha_detect': 0.46, 'nms_alpha_prop': 0.75, 'nms_thr': 0.4, 'temporal_scale': 928, 'input_feat_dim': 2304, 'bb_hidden_dim': 256, 'decoder_num_classes': 111, 'num_levels': 5, 'num_head_layers': 4, 'nfeat_mode': 'feat_ctr', 'num_neigh': 12, 'edge_weight': 'false', 'agg_type': 'max', 'gcn_insert': 'par', 'iou_thr': [0.5, 0.5, 0.7], 'anchor_scale': [1, 10], 'base_stride': 1, 'stitch_gap': 30, 'short_ratio': 0.4, 'clip_win_size': 0.38, 'use_xGPN': False, 'use_VSS': False, 'num_props': 200, 'tIoU_thr': [0.1, 0.2, 0.3, 0.4, 0.5], 'eval_stage': 'all', 'infer_datasplit': 'val'}

QinghongLin avatar Sep 19 '22 04:09 QinghongLin