Ask-Anything Evaluation results on MVBench different from the paper

Hi, I have tested the VideoChat2 model on my server and found that the test results are different from the paper. My results are listed as follows: {"Action Sequence": 66.0, "Action Prediction": 47.5, "Action Antonym": 83.5, "Fine-grained Action": 49.5, "Unexpected Action": 60.0, "Object Existence": 57.99999999999999, "Object Interaction": 71.5, "Object Shuffle": 41.5, "Moving Direction": 23.0, "Action Localization": 22.5, "Scene Transition": 88.5, "Action Count": 39.5, "Moving Count": 42.0, "Moving Attribute": 58.5, "State Change": 44.0, "Fine-grained Pose": 49.0, "Character Order": 36.5, "Egocentric Navigation": 35.0, "Episodic Reasoning": 38.5, "Counterfactual Inference": 65.0, "Avg": 50.975} The results for OS, AL, AC, ER, and CI are different. Could you help me find the reasons?

Dec 25 '23 07:12 emmating12

Hi! Could you provide your environment list, like torch and CUDA version?

Dec 26 '23 09:12 Andy1621

Hi, python=3.10.13, torch=1.13.1+cu117, torchvision=0.14.1+cu117, cuda=11.7.

Dec 27 '23 07:12 emmating12

For me, code is run at A100 with

Python=3.7.12
cuda=11.7
torch=1.13.1+cu117
torchvision=0.14.1+cu117

Jan 02 '24 07:01 Andy1621

For me, code is run at A100 with
Python=3.7.12
cuda=11.7
torch=1.13.1+cu117
torchvision=0.14.1+cu117

Hi, I have tested the VideoChat2 model on A100, python=3.8, torch=1.13.1+cu117, torchvision=0.14.1+cu117, cuda=11.7. The result for "Episodic Reasoning" is 38.5% different from the paper. The other results are the same. Could you help me find the reasons?

Jan 19 '24 08:01 emmating12

Hi! I think the reason is that you use the old version of the inference code. In the new version, I set True to use the temporal boundary, which improves the results slightly.

Jan 19 '24 08:01 Andy1621

@emmating12 Hi, we have the same reproduction results. Did you find a way to reproduce the performance on Episodic Reasoning?

@Andy1621 Thanks for the info. I used the mvbench.ipynb with the True for Episodic Reasoning but the performance is still 38.5% instead of 40.5%. Do you have any other suggestions?

Mar 19 '24 16:03 chenxshuo

Hi! I'm not sure whether you have inferred the model correctly.

Originally when I tested MVBench, I forgot to use start and end for TVQA, thus achieving 38.5% as yours.

But when I fixed the bug and used start and end (setting True), the result increased as expected, obtaining 40.5% .

Mar 19 '24 17:03 Andy1621