LLaVA icon indicating copy to clipboard operation
LLaVA copied to clipboard

[Question] Cannot reproduce MME results on LLaVA-1.5-7B

Open Carol-lyh opened this issue 1 year ago • 8 comments

Question

I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings, but get 1457.7. It's a large gap between the number paper reported 1510 on MME. However, the evaluation results on other datasets seem reasonable (except the results on ScienceQA is much higher).

Here is the results:

exp GQA ScienceQA TextVQA POPE MME
paper 62.0 66.8 58.2 85.9 1510.7
ours 62.6 70.8 58.3 85.8 1457.7

Carol-lyh avatar Oct 27 '23 06:10 Carol-lyh

Hi @Carol-lyh,

I am facing the same issue, have you figured out?

yix-chen avatar Oct 31 '23 05:10 yix-chen

Here, I am also facing the same issue. Has anyone solved this to match the score?

becxer avatar Nov 25 '23 06:11 becxer

This may be due to some unexpected randomness when using distributed training (https://github.com/haotian-liu/LLaVA/issues/864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.

This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.

Any observation/advice in terms of the randomness is welcomed.

haotian-liu avatar Nov 27 '23 16:11 haotian-liu

@haotian-liu I also cannot reproduce on MMbench dev by using v1_5/finetune13B.sh。

开发集 dev dev_overall dev_attribute_reasoning属性推理 dev_coarse_perception粗粒度感知 dev_finegrained_perception (cross-instance)多对象感知 dev_finegrained_perception (instance-level)单对象感知 dev_logic_reasoning逻辑推理 dev_relation_reasoning关系推理
llava1.5-13b论文 68.2 67.3 82.1 59.4 72 44.1 60
llava1.5-13b-ours 67.26 69.65 79.53 58.62 71.38 39.16 60.869

shipengai avatar Nov 28 '23 06:11 shipengai

Hi @Carol-lyh, I also ran the finetune.sh with the 665k instruction dataset on 7B, but I have problems reproducing the results of GQA, TextVQA, and MME. My results are 58.2, 57.5. 1476.2. Just want to check, how did you run the experiment? is it just by executing the finetune.sh?

cathyxl avatar Dec 06 '23 06:12 cathyxl

Hi @Carol-lyh, Have you tested mmvet? I used vlmevalkit, and the results of mmvet are much lower than that in vlmevalkit. image

yuangpeng avatar May 07 '24 10:05 yuangpeng

Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.

I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "

BaohaoLiao avatar May 08 '24 14:05 BaohaoLiao

Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.

I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "

Sorry for the long delay in replying. I am currently using https://github.com/open-compass/VLMEvalKit for evaluation.

yuangpeng avatar May 11 '24 08:05 yuangpeng