Request for finetuned InternVideo2-1B results on video retrieval benchmarks
Hi, great work and thanks for releasing the code. In Table 10 of your InternVideo2 paper, you reported the results of finetuning video retrieval in both T2V and V2T on MSR-VTT, LSMDC, DiDeMo, MSVD, ActivityNet, and VATEX for the 6B model.
Could you please provide the results for the finetuned InternVideo2-1B model as well?
This would be very helpful for literature comparisons with models of similar size.
Thanks a lot
Hi! Considering the cost for diverse downstream datasets, we only provide the zero-shot results~
@roberto-amoroso where you able to obtain the authors results for MSRVTT (zero shot) ?
Hi! Considering the cost for diverse downstream datasets, we only provide the zero-shot results~
@Andy1621 Thanks for your reply. Yes, I am aware that finetuning the model could be expensive, so I was hoping you have some internal results of your 1B model finetuned on MSRVTT that you could share... thanks anyway
@roberto-amoroso where you able to obtain the authors results for MSRVTT (zero shot) ?
@nsreeprem do you mean if I was able to reproduce the 0-shot performance presented in Table 9 of the paper?
@roberto-amoroso yes, I meant to ask if you were able to reproduce the results for zero-shot R@1. I am finding close to 47% (~5% lower) performance (R@1) than what is mentioned in the Table 9.
@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called msrvtt_1k_test_match in the metrics log)
@Andy1621 did you use the 9k or 7k MSR-VTT train split for finetuning the 6B model (Table 10)?
We follow Unmaked Teacher to finetune it the downstream tasks.
s2-1B model
where to get this model? thanks
Hi! Considering the cost for diverse downstream datasets, we only provide the zero-shot results~
@Andy1621 Thanks for your reply. Yes, I am aware that finetuning the model could be expensive, so I was hoping you have some internal results of your 1B model finetuned on MSRVTT that you could share... thanks anyway
The ft performance gap of 1B compared to 6B is close to the zs performance gap, you can estimate it.
@Andy1621 @leexinhao would you be releasing the hyperparameters for finetuning the 1B or 6B model?
@Andy1621 @leexinhao would you be releasing the hyperparameters for finetuning the 1B or 6B model?
You could refer to https://github.com/OpenGVLab/unmasked_teacher, we use similar hyperparameters except using deepspeed.
@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called
msrvtt_1k_test_matchin the metrics log)
Are you used the hyperparamers as reported in the config.py? I get a big performance gap between my reproduce version and the InternVideo2-1B-stage2-f4
I also tried to finetune internvideo2-stage2 model. But after fine-tuning performance decrease which should not. After fine-tune I get following results on MSRVTT dataset:
V2t_r1 → 41.6 T2v_r1 → 42.7
where zeroshot results were:
V2t_r1 → 49.9 T2v_r1 → 52.1
It would great if we get script for fine-tuning as well.
I also tried to finetune internvideo2-stage2 model. But after fine-tuning performance decrease which should not. After fine-tune I get following results on MSRVTT dataset:
V2t_r1 → 41.6 T2v_r1 → 42.7
where zeroshot results were:
V2t_r1 → 49.9 T2v_r1 → 52.1
It would great if we get script for fine-tuning as well.
That's strange, we use similar config with unmask teacher, you can try to reduce the learning rate by a factor of 10.
@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called
msrvtt_1k_test_matchin the metrics log)Are you used the hyperparamers as reported in the config.py? I get a big performance gap between my reproduce version and the InternVideo2-1B-stage2-f4
me too. Have you resolved this issue now? I would like to know how it was resolved.
@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called
msrvtt_1k_test_matchin the metrics log)
Hello, could you please share the code for reproducing zero-shot results on MSR-VTT? I’m a beginner and unsure how to reproduce the results based on the author’s code.
I also tried to finetune internvideo2-stage2 model. But after fine-tuning performance decrease which should not. After fine-tune I get following results on MSRVTT dataset:
V2t_r1 → 41.6 T2v_r1 → 42.7
where zeroshot results were:
V2t_r1 → 49.9 T2v_r1 → 52.1
It would great if we get script for fine-tuning as well.
@trahman8 Since it's been six months since your last question, I was wondering if there have been any performance improvements in your fine-tuned model on the MSRVTT dataset, and what level it has achieved now?