LLaVA [Question] Cannot reproduce MME results on LLaVA-1.5-7B

Question

I cannot reproduce MME results following finetune.sh on 665k instruction tuning dataset and evaluation scripts for MME. We followed all the settings, but get 1457.7. It's a large gap between the number paper reported 1510 on MME. However, the evaluation results on other datasets seem reasonable (except the results on ScienceQA is much higher).

Here is the results:

exp	GQA	ScienceQA	TextVQA	POPE	MME
paper	62.0	66.8	58.2	85.9	1510.7
ours	62.6	70.8	58.3	85.8	1457.7

Oct 27 '23 06:10 Carol-lyh

Hi @Carol-lyh,

I am facing the same issue, have you figured out?

Oct 31 '23 05:10 yix-chen

Here, I am also facing the same issue. Has anyone solved this to match the score?

Nov 25 '23 06:11 becxer

This may be due to some unexpected randomness when using distributed training (https://github.com/haotian-liu/LLaVA/issues/864), while we haven't figured out where the randomness is -- the data mixture order is verified to be the same across different runs, and there should not be any randomly initialized weights if we start with a pretrained projector.

This observed randomness has led to fluctuation of some benchmark performance -- MME is the most prominent (I can get +/- 20 from the report 1510 for 7B model, similar for 13B model) and other datasets are mostly stable.

Any observation/advice in terms of the randomness is welcomed.

Nov 27 '23 16:11 haotian-liu

@haotian-liu I also cannot reproduce on MMbench dev by using v1_5/finetune13B.sh。

开发集 dev	dev_overall	dev_attribute_reasoning属性推理	dev_coarse_perception粗粒度感知	dev_finegrained_perception (cross-instance)多对象感知	dev_finegrained_perception (instance-level)单对象感知	dev_logic_reasoning逻辑推理	dev_relation_reasoning关系推理
llava1.5-13b论文	68.2	67.3	82.1	59.4	72	44.1	60
llava1.5-13b-ours	67.26	69.65	79.53	58.62	71.38	39.16	60.869

Nov 28 '23 06:11 shipengai

Hi @Carol-lyh, I also ran the finetune.sh with the 665k instruction dataset on 7B, but I have problems reproducing the results of GQA, TextVQA, and MME. My results are 58.2, 57.5. 1476.2. Just want to check, how did you run the experiment? is it just by executing the finetune.sh?

Dec 06 '23 06:12 cathyxl

Hi @Carol-lyh, Have you tested mmvet? I used vlmevalkit, and the results of mmvet are much lower than that in vlmevalkit.

May 07 '24 10:05 yuangpeng

Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.

I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "

May 08 '24 14:05 BaohaoLiao

Hi @yuangpeng, may I ask how you obtain the result for MMBench? It suggests to submit the generated result to the evaluation server https://rank.opencompass.org.cn/leaderboard-multimodal. However, I couldn't find a submission guidance at the leaderboard page.

I see you submit the result to https://mmbench.opencompass.org.cn/mmbench-submission in your dreamllm project. However, this server seems to use another version of dev set, since I see some log info as "Index 1222 in your result do not exist in the released data file, thus ignored. Please use our latest released data file. "

Sorry for the long delay in replying. I am currently using https://github.com/open-compass/VLMEvalKit for evaluation.

May 11 '24 08:05 yuangpeng

LLaVA LLaVA copied to clipboard

[Question] Cannot reproduce MME results on LLaVA-1.5-7B

Question

LLaVA
LLaVA copied to clipboard