Evaluation results on MVBench different from the paper
Hi, I have tested the VideoChat2 model on my server and found that the test results are different from the paper. My results are listed as follows: {"Action Sequence": 66.0, "Action Prediction": 47.5, "Action Antonym": 83.5, "Fine-grained Action": 49.5, "Unexpected Action": 60.0, "Object Existence": 57.99999999999999, "Object Interaction": 71.5, "Object Shuffle": 41.5, "Moving Direction": 23.0, "Action Localization": 22.5, "Scene Transition": 88.5, "Action Count": 39.5, "Moving Count": 42.0, "Moving Attribute": 58.5, "State Change": 44.0, "Fine-grained Pose": 49.0, "Character Order": 36.5, "Egocentric Navigation": 35.0, "Episodic Reasoning": 38.5, "Counterfactual Inference": 65.0, "Avg": 50.975} The results for OS, AL, AC, ER, and CI are different. Could you help me find the reasons?
Hi! Could you provide your environment list, like torch and CUDA version?
Hi, python=3.10.13, torch=1.13.1+cu117, torchvision=0.14.1+cu117, cuda=11.7.
For me, code is run at A100 with
Python=3.7.12
cuda=11.7
torch=1.13.1+cu117
torchvision=0.14.1+cu117
For me, code is run at A100 with
Python=3.7.12 cuda=11.7 torch=1.13.1+cu117 torchvision=0.14.1+cu117
Hi, I have tested the VideoChat2 model on A100, python=3.8, torch=1.13.1+cu117, torchvision=0.14.1+cu117, cuda=11.7. The result for "Episodic Reasoning" is 38.5% different from the paper. The other results are the same. Could you help me find the reasons?
Hi! I think the reason is that you use the old version of the inference code. In the new version, I set True to use the temporal boundary, which improves the results slightly.
@emmating12 Hi, we have the same reproduction results. Did you find a way to reproduce the performance on Episodic Reasoning?
@Andy1621 Thanks for the info. I used the mvbench.ipynb with the True for Episodic Reasoning but the performance is still 38.5% instead of 40.5%. Do you have any other suggestions?
Hi! I'm not sure whether you have inferred the model correctly.
Originally when I tested MVBench, I forgot to use start and end for TVQA, thus achieving 38.5% as yours.
But when I fixed the bug and used start and end (setting True), the result increased as expected, obtaining 40.5% .