InternVideo Confusion about zero-shot setting on Video-Text Retrieval

Thank you for your in interesting work and your shared code! I'm very confused that whether the zero-shot performance on MSRVTT reported in here requires setting “--mergeclip=True”? Below is the result I reproduced： “--mergeclip=True”： “--mergeclip=False”：

AS the provided file defaults to "--mergeclip=True", I wonder if there is something wrong with this.

Mar 22 '24 14:03 Cola-any

it seems that when setting “merge=True”,the results are better than the paper presented?

Mar 26 '24 09:03 1240446371

it seems that when setting “merge=True”,the results are better than the paper presented?

Yes. It seems that the results reported in the paper are obtained by setting “merge=True” without DSL.

Mar 26 '24 09:03 Cola-any

it seems that when setting “merge=True”,the results are better than the paper presented?

Yes. It seems that the results reported in the paper are obtained by setting “merge=True” without DSL.

I test the performance on activityNet，and obtain better results on “merge=True” with DSL，but obtain worse results on “merge=True” without DSL（worse than paper presented）. The author replied to another people that they use DSL results. I also confuse about which setting they use ~~

Mar 27 '24 05:03 1240446371

it seems that when setting “merge=True”,the results are better than the paper presented?

Yes. It seems that the results reported in the paper are obtained by setting “merge=True” without DSL.

I test the performance on activityNet，and obtain better results on “merge=True” with DSL，but obtain worse results on “merge=True” without DSL（worse than paper presented）. The author replied to another people that they use DSL results. I also confuse about which setting they use ~~

Hi, were u able to resolve the confusion?

May 03 '24 03:05 Hari-Durai-Baskar