VLMEvalKit Unable to reproduce llava_v1.5_7b scores on several benchmarks including MME, TextVQA, POPE etc.

Hello! First of all, thank you very much for carrying out such outstanding work! I am currently trying to reproduce various benchmark results of llava_v1.5_7b, but I'm encountering some difficulties, so I would like to ask for your assistance. Here are the details of my setup (the exact versions that llava requires in the official repo):

transformers              4.37.2
torch                     2.1.2
torchvision               0.16.2

GPU : 4x A5000

The Open-Compass Leaderboard(which utilized VLMEvalKit) seems to get results consistent or very much higher than reported in the paper. with an MME score of 1506.2, TextVQA score of 45.5 and POPE score of 86.1 etc. (The paper reports each score to be 1510.7, 58.2, 87.3)

However, when I try to evaluate the model myself, I get much lower scores. (This is a fresh clone of VLMEvalKit and LLaVA as of May 29, 2025)

MME (./scripts/run.sh --model llava_v1.5_7b --data MME)

---------------------  --------
perception             1373.46
reasoning               304.643
OCR                     130
artwork                 113.5
celebrity               124.118
code_reasoning           65
color                   156.667
commonsense_reasoning   107.143
count                   125
existence               185
landmark                134.75
numerical_calculation    40
position                110
posters                 142.177
scene                   152.25
text_translation         92.5
---------------------  --------

TextVQA (./scripts/run.sh --model llava_v1.5_7b --data TextVQA_VAL)

-  -----
0  21.86
-  -----

POPE (./scripts/run.sh --model llava_v1.5_7b --data POPE)

---------  -----------------  -----------------  -----------------  -----------------
split      Overall            random             adversarial        popular
Overall    80.19323671497585  82.11567732115678  78.01879971077368  80.55244494214259
acc        81.39999999999999  83.83333333333334  78.7               81.66666666666667
precision  90.59613769941225  95.65602836879432  85.22906793048973  91.51823579304495
recall     71.93333333333334  71.93333333333334  71.93333333333334  71.93333333333334
---------  -----------------  -----------------  -----------------  -----------------

So, the results would be

Dataset	Official	OpenVLM Leaderboard	Reproduction
MME	1510.7	1506.2	1373.46
TextVQA	58.2	45.5	21.86
POPE	87.3	86.1	80.19

Although I do acknowledge that VLMEvalKit is meant not to be used to precisely reproduce the reported scores, I do believe that the discrepencies are not trivial.

I have also confirmed that the LLM Judge chatgpt-0125 had been utilized correctly, by checking the API Usage on the OpenAI webpage.

If there is anyone who can kindly assist me to overcome the differences, feel free to let me know!

I might have found a probable cause for the low score of MME and POPE, being that the custom prompts haven't been being applied properly, since they are YORN datasets and not MCQ ones. Can anyone else confirm this issue?

May 29 '25 14:05 sanghyunna

generate_inner()的问题

Jun 03 '25 12:06 wangyunnan

generate_inner()的问题

Sorry for the late response. Can you provide more details on what problems there are with generate_inner()? My generate_inner() is a freshly cloned version.

Jun 06 '25 06:06 sanghyunna

Hello! First of all, thank you very much for carrying out such outstanding work! I am currently trying to reproduce various benchmark results of llava_v1.5_7b, but I'm encountering some difficulties, so I would like to ask for your assistance. Here are the details of my setup (the exact versions that llava requires in the official repo):
transformers              4.37.2
torch                     2.1.2
torchvision               0.16.2
GPU : 4x A5000

The Open-Compass Leaderboard(which utilized VLMEvalKit) seems to get results consistent or very much higher than reported in the paper. with an MME score of 1506.2, TextVQA score of 45.5 and POPE score of 86.1 etc. (The paper reports each score to be 1510.7, 58.2, 87.3)

However, when I try to evaluate the model myself, I get much lower scores. (This is a fresh clone of VLMEvalKit and LLaVA as of May 29, 2025)

MME (./scripts/run.sh --model llava_v1.5_7b --data MME)
---------------------  --------
perception             1373.46
reasoning               304.643
OCR                     130
artwork                 113.5
celebrity               124.118
code_reasoning           65
color                   156.667
commonsense_reasoning   107.143
count                   125
existence               185
landmark                134.75
numerical_calculation    40
position                110
posters                 142.177
scene                   152.25
text_translation         92.5
---------------------  --------
TextVQA (./scripts/run.sh --model llava_v1.5_7b --data TextVQA_VAL)
-  -----
0  21.86
-  -----
POPE (./scripts/run.sh --model llava_v1.5_7b --data POPE)
---------  -----------------  -----------------  -----------------  -----------------
split      Overall            random             adversarial        popular
Overall    80.19323671497585  82.11567732115678  78.01879971077368  80.55244494214259
acc        81.39999999999999  83.83333333333334  78.7               81.66666666666667
precision  90.59613769941225  95.65602836879432  85.22906793048973  91.51823579304495
recall     71.93333333333334  71.93333333333334  71.93333333333334  71.93333333333334
---------  -----------------  -----------------  -----------------  -----------------
So, the results would be

Dataset Official OpenVLM Leaderboard Reproduction MME 1510.7 1506.2 1373.46 TextVQA 58.2 45.5 21.86 POPE 87.3 86.1 80.19 Although I do acknowledge that VLMEvalKit is meant not to be used to precisely reproduce the reported scores, I do believe that the discrepencies are not trivial.

I have also confirmed that the LLM Judge chatgpt-0125 had been utilized correctly, by checking the API Usage on the OpenAI webpage.

If there is anyone who can kindly assist me to overcome the differences, feel free to let me know!

I might have found a probable cause for the low score of MME and POPE, being that the custom prompts haven't been being applied properly, since they are YORN datasets and not MCQ ones. Can anyone else confirm this issue?

I also encountered this issue, and my results are similar to yours. Did you resolve it?

Jul 21 '25 08:07 AZYoung233

@AZYoung233 Sorry for the late reply. I have not been able to solve this issue, so I had to use lmms-eval to evaluate instead.

Jul 25 '25 04:07 sanghyunna