Unable to reproduce InternVL3.5 GPT-OSS on MMMU
Hi,
Thanks for including InternVL3.5 in the repo. However, I tried to evaluate InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking on MMMU_DEV_VAL, and the model keeps repeating itself at the end of the response. I've attached the result from this command python run.py --data MMMU_DEV_VAL --model InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking --verbose --reuse. I believe there might be some template adaptation that needs to be done.
Best, Yu-Cheng
InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking_MMMU_DEV_VAL_openai_result.xlsx
It seems to be an issue with the xlsx format, which can only store up to 32,768 characters per cell. Therefore, when Thinking mode is activated, the output is easily truncated, and the final answer cannot be parsed. We have contacted the OpenCompass Team, and they will fix this issue as soon as possible.
Hi Weiyun,
Thanks for pointing out the format issue. However, in my case, I've encountered the repetition issue that the model kept repeating itself, as I mentioned in this issue. I have set do_sample=True temperature=0.6 according to the model card with the previous commit (with chat_template_config), but still encountered this repetition problem.
After pulling the new commits (without chat_template_config), the model refuses to respond with a reasoning trace. Please find the prediction in the attached file.
To conclude, with do_sample=True temperature=0.6, the model repeats itself with chat_template_config and refuses to produce reasoning traces without chat_template_config. As a result, I cannot reproduce the performance shown in the report, i.e., 59.2 (no-reasoning) / 26.2 (truncated) vs 72.6.
It would be great if you could share your inference setting (transformers/vllm/lmdeploy) and the corresponding code. Thank you so much for your help.
Best, Yu-Cheng
InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking_MMMU_DEV_VAL_openai_result.xlsx
Thank you for taking the time to share your suggestion. We’ve addressed the issue in PR 1294.
Now use SPLIT_THINK=True and PRED_FORMAT=tsv will lead to correct results.
Hi @PhoenixZ810,
I’m still experiencing the repetition issue with the InternVL3.5 models in thinking mode, even after updating to the latest commit (cfa63b1df9d861313cc444af676488b6f1445222).
Here’s the setup I’m using:
export USE_COT=1
export SPLIT_THINK=True
export PRED_FORMAT=tsv
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B --verbose
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B-Thinking --verbose
Could you take another look when you get a chance? Thanks!
Hi @PhoenixZ810,
I’m still experiencing the repetition issue with the InternVL3.5 models in thinking mode, even after updating to the latest commit (cfa63b1).
Here’s the setup I’m using:
export USE_COT=1 export SPLIT_THINK=True export PRED_FORMAT=tsv python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B --verbose python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B-Thinking --verboseCould you take another look when you get a chance? Thanks!
Hi, please use config InternVL3_5-1B-Thinking , which apply cot_prompt_version="r1" for MMMU benchmark.
Hi @PhoenixZ810, with InternVL_3_5_1B_Thinking, I'm still suffering the repetition issue. Could you check it again? Thanks!
https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/config.py#L996-L999
"InternVL3_5-1B-Thinking": partial(
InternVLChat, model_path="OpenGVLab/InternVL3_5-1B", use_lmdeploy=True,
max_new_tokens=2**16, cot_prompt_version="r1", do_sample=True, version="V2.0"
),
Hi @PhoenixZ810, with
InternVL_3_5_1B_Thinking, I'm still suffering the repetition issue. Could you check it again? Thanks! https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/config.py#L996-L999"InternVL3_5-1B-Thinking": partial( InternVLChat, model_path="OpenGVLab/InternVL3_5-1B", use_lmdeploy=True, max_new_tokens=2**16, cot_prompt_version="r1", do_sample=True, version="V2.0" ),
We verified InternVL3.5 using the latest main branch of VLMEvalKit (commit: cfa63b1df9d861313cc444af676488b6f1445222). We confirmed that with Thinking mode enabled, the reasoning benchmark results match those reported in our paper.
If you encounter a small amount of repetition during reasoning, this is normal. However, if you see large amounts of repetition, it may indicate an issue with the runtime environment or other configurations. By the way, for the 1B model, because it is very small, repetition is unavoidable in long-sequence output settings. This is a known issue, and we recommend using larger models such as 4B or 8B for more stable generation.
Feel free to contact me if you still have more questions.
Hi Weiyun,
Thanks for your updates. I have tried evaluting InternVL3_5-30B-A3B-Thinking and InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking using the latest commit you mentioned. However, I can only reproduce InternVL3_5-30B-A3B-Thinking with 75.1 accuracy on MMMU Val. InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking only achieves 30.1 accuracy on MMMU Val. After inspecting the response, I found three major issues:
- Repetition of reasoning process, causing no answer prediction.
- Incorrect thinking format with no
</think>to parse the answer, causingMatch log: Z. - Answering without matching options, causing
Match log: Z.
Eventhough I remove these cases (repetition and Match log: Z) and only evaluate the filtered responses (396 left), InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking only achieves 54.0 accuracy.
I used the following commands to run the evaluation:
SPLIT_THINK=True PRED_FORMAT=tsv python run.py --data MMMU_DEV_VAL --model InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking --verbose --reuse
SPLIT_THINK=True PRED_FORMAT=tsv python run.py --data MMMU_DEV_VAL --model InternVL3_5-30B-A3B-Thinking --verbose --reuse
Best, Yu-Cheng
Thanks for your great work on VLMEvalKit and the powerful InternVL models.
I've been trying to evaluate the InternVL3_5-38B model on the MMMU_DEV_VAL benchmark. I've been following the discussions in issue #1223 and have tried to use the recommended settings.
Here is the exact command I used for the evaluation:
export USE_COT=1
export SPLIT_THINK=True
export PRED_FORMAT=tsv
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-38B --verbose
Since my server cannot access the internet, I did not use GPT for answer verification, the final accuracy I'm getting is only 25%.
InternVL3_5-38B-Thinking_MMMU_DEV_VAL.tsv
I am attaching the resulting CSV file which contains the detailed model predictions for your review. Could you please take a look at my setup and the results? I would be very grateful for any advice on whether there's an issue with my command, environment, or if there are other specific configurations required for the InternVL3_5-38B model to achieve the reported performance on MMMU.
Thank you for your time and help!