Unable to reproduce llava_v1.5_7b scores on several benchmarks including MME, TextVQA, POPE etc.
Hello!
First of all, thank you very much for carrying out such outstanding work!
I am currently trying to reproduce various benchmark results of llava_v1.5_7b, but I'm encountering some difficulties, so I would like to ask for your assistance.
Here are the details of my setup (the exact versions that llava requires in the official repo):
transformers 4.37.2
torch 2.1.2
torchvision 0.16.2
GPU : 4x A5000
The Open-Compass Leaderboard(which utilized VLMEvalKit) seems to get results consistent or very much higher than reported in the paper.
with an MME score of 1506.2, TextVQA score of 45.5 and POPE score of 86.1 etc.
(The paper reports each score to be 1510.7, 58.2, 87.3)
However, when I try to evaluate the model myself, I get much lower scores. (This is a fresh clone of VLMEvalKit and LLaVA as of May 29, 2025)
- MME (
./scripts/run.sh --model llava_v1.5_7b --data MME)
--------------------- --------
perception 1373.46
reasoning 304.643
OCR 130
artwork 113.5
celebrity 124.118
code_reasoning 65
color 156.667
commonsense_reasoning 107.143
count 125
existence 185
landmark 134.75
numerical_calculation 40
position 110
posters 142.177
scene 152.25
text_translation 92.5
--------------------- --------
- TextVQA (
./scripts/run.sh --model llava_v1.5_7b --data TextVQA_VAL)
- -----
0 21.86
- -----
- POPE (
./scripts/run.sh --model llava_v1.5_7b --data POPE)
--------- ----------------- ----------------- ----------------- -----------------
split Overall random adversarial popular
Overall 80.19323671497585 82.11567732115678 78.01879971077368 80.55244494214259
acc 81.39999999999999 83.83333333333334 78.7 81.66666666666667
precision 90.59613769941225 95.65602836879432 85.22906793048973 91.51823579304495
recall 71.93333333333334 71.93333333333334 71.93333333333334 71.93333333333334
--------- ----------------- ----------------- ----------------- -----------------
So, the results would be
| Dataset | Official | OpenVLM Leaderboard | Reproduction |
|---|---|---|---|
| MME | 1510.7 | 1506.2 | 1373.46 |
| TextVQA | 58.2 | 45.5 | 21.86 |
| POPE | 87.3 | 86.1 | 80.19 |
Although I do acknowledge that VLMEvalKit is meant not to be used to precisely reproduce the reported scores, I do believe that the discrepencies are not trivial.
I have also confirmed that the LLM Judge chatgpt-0125 had been utilized correctly, by checking the API Usage on the OpenAI webpage.
If there is anyone who can kindly assist me to overcome the differences, feel free to let me know!
- I might have found a probable cause for the low score of MME and POPE, being that the custom prompts haven't been being applied properly, since they are YORN datasets and not MCQ ones. Can anyone else confirm this issue?
generate_inner()的问题
generate_inner()的问题
Sorry for the late response.
Can you provide more details on what problems there are with generate_inner()?
My generate_inner() is a freshly cloned version.
Hello! First of all, thank you very much for carrying out such outstanding work! I am currently trying to reproduce various benchmark results of
llava_v1.5_7b, but I'm encountering some difficulties, so I would like to ask for your assistance. Here are the details of my setup (the exact versions that llava requires in the official repo):transformers 4.37.2 torch 2.1.2 torchvision 0.16.2GPU : 4x A5000
The Open-Compass Leaderboard(which utilized VLMEvalKit) seems to get results consistent or very much higher than reported in the paper. with an MME score of
1506.2, TextVQA score of45.5and POPE score of86.1etc. (The paper reports each score to be1510.7,58.2,87.3)However, when I try to evaluate the model myself, I get much lower scores. (This is a fresh clone of VLMEvalKit and LLaVA as of May 29, 2025)
- MME (
./scripts/run.sh --model llava_v1.5_7b --data MME)--------------------- -------- perception 1373.46 reasoning 304.643 OCR 130 artwork 113.5 celebrity 124.118 code_reasoning 65 color 156.667 commonsense_reasoning 107.143 count 125 existence 185 landmark 134.75 numerical_calculation 40 position 110 posters 142.177 scene 152.25 text_translation 92.5 --------------------- --------
- TextVQA (
./scripts/run.sh --model llava_v1.5_7b --data TextVQA_VAL)- ----- 0 21.86 - -----
- POPE (
./scripts/run.sh --model llava_v1.5_7b --data POPE)--------- ----------------- ----------------- ----------------- ----------------- split Overall random adversarial popular Overall 80.19323671497585 82.11567732115678 78.01879971077368 80.55244494214259 acc 81.39999999999999 83.83333333333334 78.7 81.66666666666667 precision 90.59613769941225 95.65602836879432 85.22906793048973 91.51823579304495 recall 71.93333333333334 71.93333333333334 71.93333333333334 71.93333333333334 --------- ----------------- ----------------- ----------------- -----------------So, the results would be
Dataset Official OpenVLM Leaderboard Reproduction MME 1510.7 1506.2 1373.46 TextVQA 58.2 45.5 21.86 POPE 87.3 86.1 80.19 Although I do acknowledge that VLMEvalKit is meant not to be used to precisely reproduce the reported scores, I do believe that the discrepencies are not trivial.
I have also confirmed that the LLM Judge
chatgpt-0125had been utilized correctly, by checking the API Usage on the OpenAI webpage.If there is anyone who can kindly assist me to overcome the differences, feel free to let me know!
- I might have found a probable cause for the low score of MME and POPE, being that the custom prompts haven't been being applied properly, since they are YORN datasets and not MCQ ones. Can anyone else confirm this issue?
I also encountered this issue, and my results are similar to yours. Did you resolve it?
@AZYoung233
Sorry for the late reply.
I have not been able to solve this issue, so I had to use lmms-eval to evaluate instead.