EgoVLP About NLQ results.

Hello. Thanks for such nice work! Now, we have some questions and want your help. We use your EgoVLP_PT_BEST checkpoint to extract the video feature. We train VSLNet with the feature and the bert checkpoint from EgoVLP_PT_BEST . It Can't seem to get the precision you have in the report, and we only get about 7~8 [email protected].

Aug 10 '22 06:08 takfate

Thanks,

We get similar results ~8 [email protected] based on the default settings, and we further boost the performance based on some parameter tuning (e.g., learning rate, batch size).

I attached our config.json and log of best results in here model.zip, hope it helps you reproduce the results.

Please reach out if you have new updates.

Aug 18 '22 05:08 QinghongLin

Thanks for your response. for feature extraction, does the model contain video proj (video_dim->256) and text proj (text_dim->256). Are the channels of video feature and text feature 256?

Aug 23 '22 05:08 takfate

Yes, during the feature extraction, the model contains video_proj and text_proj, and the channels of video and text features are 256.

Aug 23 '22 08:08 QinghongLin

Is the args.token True when extracting text feature? We find the extracted text feature by default is 1x256.

Aug 24 '22 13:08 takfate

Is the args.token True when extracting text feature? We find the extracted text feature by default is 1x256.

In our experiments, it seems that using Lx256 and using 1x256 have similar performance. But they are both weaker than using Lx768. Using Lx768 can obtain performance similar to your results, but still, have about a 0.4 gap.

Aug 25 '22 07:08 takfate

@takfate

Hi, the NLQ results are implemented by my collaborator Mattia, I may misalign some details, I attach our VSLNet code implementation here so that you can refer to the relevant details.

NLQ.zip

Aug 25 '22 07:08 QinghongLin

EgoVLP EgoVLP copied to clipboard

About NLQ results.

EgoVLP
EgoVLP copied to clipboard