InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

About eval result of InterVL3

Open Kyrie666 opened this issue 8 months ago • 7 comments

Hello, I have some questions about the leaderboard evaluation results. The leaderboard shows that the 1B model mataVista has a score of 45.8, but my actual measurement using VLMEvalKit is 46.3.

The leaderboard shows that the 38B model mataVista has a score of 75.1, but my actual measurement is 71.1. The difference in results is quite significant. Could you please let me know what the reasons might be and what settings were used during the evaluation?

Kyrie666 avatar Apr 21 '25 03:04 Kyrie666

Image

Kyrie666 avatar Apr 21 '25 03:04 Kyrie666

Please export USE_COT="1" before running the VLMEvalKit scripts. The Chain-of-Thought (CoT) performance of models larger than 2B is significantly better than that without CoT.

Weiyun1025 avatar May 06 '25 06:05 Weiyun1025

@Weiyun1025 How about the small model like 1B? also need to config USE_COT? the eval result is also getting lower scores than the blog without USE_COT.

Kyrie666 avatar May 06 '25 09:05 Kyrie666

The CoT performance of the 1B model is comparable to its performance without CoT. The MathVista accuracy of InternVL3-1B is 45.8 according to our evaluation, which appears to be comparable to your result (46.3).

Weiyun1025 avatar May 06 '25 11:05 Weiyun1025

The result of MMMU is lower

---Original--- From: "Sawyer @.> Date: Tue, May 6, 2025 19:46 PM To: @.>; Cc: "Kyrie @.@.>; Subject: Re: [OpenGVLab/InternVL] About eval result of InterVL3 (Issue #1003)

Weiyun1025 left a comment (OpenGVLab/InternVL#1003)

The CoT performance of the 1B model is comparable to its performance without CoT. The MathVista accuracy of InternVL3-1B is 45.8 according to our evaluation, which appears to be comparable to your result (46.3).

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

Kyrie666 avatar May 06 '25 12:05 Kyrie666

@Weiyun1025 Compared with the evaluation results in the blog, except for MathVista, all other results of the 1B without USE_COT model are declining.

Kyrie666 avatar May 08 '25 06:05 Kyrie666

Excuse me, have you reproduced the indicators for internvl3?

Graysonicc avatar Sep 13 '25 06:09 Graysonicc