EgoVLP
EgoVLP copied to clipboard
About NLQ results.
Hello.
Thanks for such nice work!
Now, we have some questions and want your help.
We use your EgoVLP_PT_BEST checkpoint to extract the video feature.
We train VSLNet with the feature and the bert
checkpoint from EgoVLP_PT_BEST
.
It Can't seem to get the precision you have in the report, and we only get about 7~8 [email protected].
Thanks,
We get similar results ~8 [email protected] based on the default settings, and we further boost the performance based on some parameter tuning (e.g., learning rate, batch size).

I attached our config.json and log of best results in here model.zip, hope it helps you reproduce the results.
Please reach out if you have new updates.
Thanks for your response.
for feature extraction, does the model contain video proj
(video_dim->256) and text proj
(text_dim->256).
Are the channels of video feature and text feature 256?
Yes, during the feature extraction, the model contains video_proj
and text_proj
, and the channels of video and text features are 256.
Is the args.token True
when extracting text feature?
We find the extracted text feature by default is 1x256
.
Is the args.token
True
when extracting text feature? We find the extracted text feature by default is1x256
.
In our experiments, it seems that using Lx256
and using 1x256
have similar performance. But they are both weaker than using Lx768
. Using Lx768
can obtain performance similar to your results, but still, have about a 0.4 gap.
@takfate
Hi, the NLQ results are implemented by my collaborator Mattia, I may misalign some details, I attach our VSLNet code implementation here so that you can refer to the relevant details.