VLMEvalKit Unable to reproduce InternVL3.5 GPT-OSS on MMMU

Hi,

Thanks for including InternVL3.5 in the repo. However, I tried to evaluate InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking on MMMU_DEV_VAL, and the model keeps repeating itself at the end of the response. I've attached the result from this command python run.py --data MMMU_DEV_VAL --model InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking --verbose --reuse. I believe there might be some template adaptation that needs to be done.

Best, Yu-Cheng

InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking_MMMU_DEV_VAL_openai_result.xlsx

Sep 05 '25 21:09 johnson111788

It seems to be an issue with the xlsx format, which can only store up to 32,768 characters per cell. Therefore, when Thinking mode is activated, the output is easily truncated, and the final answer cannot be parsed. We have contacted the OpenCompass Team, and they will fix this issue as soon as possible.

Sep 09 '25 03:09 Weiyun1025

Hi Weiyun,

Thanks for pointing out the format issue. However, in my case, I've encountered the repetition issue that the model kept repeating itself, as I mentioned in this issue. I have set do_sample=True temperature=0.6 according to the model card with the previous commit (with chat_template_config), but still encountered this repetition problem.

After pulling the new commits (without chat_template_config), the model refuses to respond with a reasoning trace. Please find the prediction in the attached file.

To conclude, with do_sample=True temperature=0.6, the model repeats itself with chat_template_config and refuses to produce reasoning traces without chat_template_config. As a result, I cannot reproduce the performance shown in the report, i.e., 59.2 (no-reasoning) / 26.2 (truncated) vs 72.6.

It would be great if you could share your inference setting (transformers/vllm/lmdeploy) and the corresponding code. Thank you so much for your help.

Best, Yu-Cheng

InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking_MMMU_DEV_VAL_openai_result.xlsx

Sep 11 '25 23:09 johnson111788

Thank you for taking the time to share your suggestion. We’ve addressed the issue in PR 1294. Now use SPLIT_THINK=True and PRED_FORMAT=tsv will lead to correct results.

Sep 12 '25 06:09 PhoenixZ810

Hi @PhoenixZ810,

I’m still experiencing the repetition issue with the InternVL3.5 models in thinking mode, even after updating to the latest commit (cfa63b1df9d861313cc444af676488b6f1445222).

Here’s the setup I’m using:

export USE_COT=1
export SPLIT_THINK=True
export PRED_FORMAT=tsv
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B --verbose
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B-Thinking --verbose

Could you take another look when you get a chance? Thanks!

Sep 14 '25 08:09 JihwanEom

Hi @PhoenixZ810,

I’m still experiencing the repetition issue with the InternVL3.5 models in thinking mode, even after updating to the latest commit (cfa63b1).

Here’s the setup I’m using:
export USE_COT=1
export SPLIT_THINK=True
export PRED_FORMAT=tsv
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B --verbose
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-1B-Thinking --verbose
Could you take another look when you get a chance? Thanks!

Hi, please use config InternVL3_5-1B-Thinking , which apply cot_prompt_version="r1" for MMMU benchmark.

Sep 15 '25 02:09 PhoenixZ810

Hi @PhoenixZ810, with InternVL_3_5_1B_Thinking, I'm still suffering the repetition issue. Could you check it again? Thanks! https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/config.py#L996-L999

    "InternVL3_5-1B-Thinking": partial(
        InternVLChat, model_path="OpenGVLab/InternVL3_5-1B", use_lmdeploy=True,
        max_new_tokens=2**16, cot_prompt_version="r1", do_sample=True, version="V2.0"
    ),

Sep 15 '25 03:09 JihwanEom

Hi @PhoenixZ810, with InternVL_3_5_1B_Thinking, I'm still suffering the repetition issue. Could you check it again? Thanks! https://github.com/open-compass/VLMEvalKit/blob/main/vlmeval/config.py#L996-L999
"InternVL3_5-1B-Thinking": partial(
    InternVLChat, model_path="OpenGVLab/InternVL3_5-1B", use_lmdeploy=True,
    max_new_tokens=2**16, cot_prompt_version="r1", do_sample=True, version="V2.0"
),

We verified InternVL3.5 using the latest main branch of VLMEvalKit (commit: cfa63b1df9d861313cc444af676488b6f1445222). We confirmed that with Thinking mode enabled, the reasoning benchmark results match those reported in our paper.

If you encounter a small amount of repetition during reasoning, this is normal. However, if you see large amounts of repetition, it may indicate an issue with the runtime environment or other configurations. By the way, for the 1B model, because it is very small, repetition is unavoidable in long-sequence output settings. This is a known issue, and we recommend using larger models such as 4B or 8B for more stable generation.

Feel free to contact me if you still have more questions.

Sep 16 '25 03:09 Weiyun1025

Hi Weiyun,

Thanks for your updates. I have tried evaluting InternVL3_5-30B-A3B-Thinking and InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking using the latest commit you mentioned. However, I can only reproduce InternVL3_5-30B-A3B-Thinking with 75.1 accuracy on MMMU Val. InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking only achieves 30.1 accuracy on MMMU Val. After inspecting the response, I found three major issues:

Repetition of reasoning process, causing no answer prediction.
Incorrect thinking format with no </think> to parse the answer, causing Match log: Z.
Answering without matching options, causing Match log: Z.

Eventhough I remove these cases (repetition and Match log: Z) and only evaluate the filtered responses (396 left), InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking only achieves 54.0 accuracy.

I used the following commands to run the evaluation:

SPLIT_THINK=True PRED_FORMAT=tsv python run.py --data MMMU_DEV_VAL --model InternVL3_5-GPT-OSS-20B-A4B-Preview-Thinking --verbose --reuse
SPLIT_THINK=True PRED_FORMAT=tsv python run.py --data MMMU_DEV_VAL --model InternVL3_5-30B-A3B-Thinking --verbose --reuse

Best, Yu-Cheng

Sep 19 '25 17:09 johnson111788

Thanks for your great work on VLMEvalKit and the powerful InternVL models.

I've been trying to evaluate the InternVL3_5-38B model on the MMMU_DEV_VAL benchmark. I've been following the discussions in issue #1223 and have tried to use the recommended settings.

Here is the exact command I used for the evaluation:

export USE_COT=1
export SPLIT_THINK=True
export PRED_FORMAT=tsv
python3 run.py --data MMMU_DEV_VAL --model InternVL3_5-38B --verbose

Since my server cannot access the internet, I did not use GPT for answer verification, the final accuracy I'm getting is only 25%.

InternVL3_5-38B-Thinking_MMMU_DEV_VAL.tsv

I am attaching the resulting CSV file which contains the detailed model predictions for your review. Could you please take a look at my setup and the results? I would be very grateful for any advice on whether there's an issue with my command, environment, or if there are other specific configurations required for the InternVL3_5-38B model to achieve the reported performance on MMMU.

Thank you for your time and help!

Sep 21 '25 03:09 xiaoBIGfeng