InternVideo Request for finetuned InternVideo2-1B results on video retrieval benchmarks

Hi, great work and thanks for releasing the code. In Table 10 of your InternVideo2 paper, you reported the results of finetuning video retrieval in both T2V and V2T on MSR-VTT, LSMDC, DiDeMo, MSVD, ActivityNet, and VATEX for the 6B model.

Could you please provide the results for the finetuned InternVideo2-1B model as well?

This would be very helpful for literature comparisons with models of similar size.

Thanks a lot

Jun 11 '24 19:06 roberto-amoroso

Hi! Considering the cost for diverse downstream datasets, we only provide the zero-shot results~

Jun 12 '24 03:06 Andy1621

@roberto-amoroso where you able to obtain the authors results for MSRVTT (zero shot) ?

Jun 12 '24 06:06 nsreeprem

Hi! Considering the cost for diverse downstream datasets, we only provide the zero-shot results~

@Andy1621 Thanks for your reply. Yes, I am aware that finetuning the model could be expensive, so I was hoping you have some internal results of your 1B model finetuned on MSRVTT that you could share... thanks anyway

Jun 12 '24 09:06 roberto-amoroso

@roberto-amoroso where you able to obtain the authors results for MSRVTT (zero shot) ?

@nsreeprem do you mean if I was able to reproduce the 0-shot performance presented in Table 9 of the paper?

Jun 12 '24 09:06 roberto-amoroso

@roberto-amoroso yes, I meant to ask if you were able to reproduce the results for zero-shot R@1. I am finding close to 47% (~5% lower) performance (R@1) than what is mentioned in the Table 9.

Jun 12 '24 10:06 nsreeprem

@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called msrvtt_1k_test_match in the metrics log)

Jun 12 '24 10:06 roberto-amoroso

@Andy1621 did you use the 9k or 7k MSR-VTT train split for finetuning the 6B model (Table 10)?

Jun 12 '24 11:06 roberto-amoroso

We follow Unmaked Teacher to finetune it the downstream tasks.

Jun 12 '24 12:06 Andy1621

s2-1B model

where to get this model? thanks

Jun 20 '24 08:06 pribadihcr

Hi! Considering the cost for diverse downstream datasets, we only provide the zero-shot results~

@Andy1621 Thanks for your reply. Yes, I am aware that finetuning the model could be expensive, so I was hoping you have some internal results of your 1B model finetuned on MSRVTT that you could share... thanks anyway

The ft performance gap of 1B compared to 6B is close to the zs performance gap, you can estimate it.

Jun 26 '24 04:06 leexinhao

@Andy1621 @leexinhao would you be releasing the hyperparameters for finetuning the 1B or 6B model?

Jun 26 '24 08:06 nsreeprem

@Andy1621 @leexinhao would you be releasing the hyperparameters for finetuning the 1B or 6B model?

You could refer to https://github.com/OpenGVLab/unmasked_teacher, we use similar hyperparameters except using deepspeed.

Aug 16 '24 07:08 leexinhao

@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called msrvtt_1k_test_match in the metrics log)

Are you used the hyperparamers as reported in the config.py? I get a big performance gap between my reproduce version and the InternVideo2-1B-stage2-f4

Aug 20 '24 07:08 haoyi199815

I also tried to finetune internvideo2-stage2 model. But after fine-tuning performance decrease which should not. After fine-tune I get following results on MSRVTT dataset:

V2t_r1 → 41.6 T2v_r1 → 42.7

where zeroshot results were:

V2t_r1 → 49.9 T2v_r1 → 52.1

It would great if we get script for fine-tuning as well.

Oct 18 '24 22:10 trahman8

I also tried to finetune internvideo2-stage2 model. But after fine-tuning performance decrease which should not. After fine-tune I get following results on MSRVTT dataset:

V2t_r1 → 41.6 T2v_r1 → 42.7

where zeroshot results were:

V2t_r1 → 49.9 T2v_r1 → 52.1

It would great if we get script for fine-tuning as well.

That's strange, we use similar config with unmask teacher, you can try to reduce the learning rate by a factor of 10.

Oct 19 '24 11:10 leexinhao

@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called msrvtt_1k_test_match in the metrics log)

Are you used the hyperparamers as reported in the config.py? I get a big performance gap between my reproduce version and the InternVideo2-1B-stage2-f4

me too. Have you resolved this issue now? I would like to know how it was resolved.

Nov 19 '24 09:11 Volibear1234

@nsreeprem The 0-shot performance I measured on MSRVTT by using the s2-1B model is 51.8 (0.1% lower) for T2V R@1 and 49.3 (1.6% lower) for V2T R@1. These results are obtained by considering the ITM re-ranking stage (i.e., what is called msrvtt_1k_test_match in the metrics log)

Hello, could you please share the code for reproducing zero-shot results on MSR-VTT? I’m a beginner and unsure how to reproduce the results based on the author’s code.

Nov 20 '24 03:11 Volibear1234

I also tried to finetune internvideo2-stage2 model. But after fine-tuning performance decrease which should not. After fine-tune I get following results on MSRVTT dataset:

V2t_r1 → 41.6 T2v_r1 → 42.7

where zeroshot results were:

V2t_r1 → 49.9 T2v_r1 → 52.1

It would great if we get script for fine-tuning as well.

@trahman8 Since it's been six months since your last question, I was wondering if there have been any performance improvements in your fine-tuned model on the MSRVTT dataset, and what level it has achieved now?

Mar 19 '25 08:03 Eliza-and-black