VLMEvalKit A bug in qwen3vl

Thanks for the author's great efforts on this work. I found that the new version of qwen3vl code seems to have a bug. In qwen3_vl/model.py l81, the generation config does not have the “do_sample” parameter when it is set. It seems that this will cause a conflict.

Oct 17 '25 15:10 hanzifan

Is there any evaluation code about Qwen3-VL? I try to search on GitHub, but it's difficult to find running code that we can directly use. The official evaluation code from the Qwen3-vl repo is actually not a complete code to reproduce the results in the report.

Oct 19 '25 07:10 sglucas

Is there any evaluation code about Qwen3-VL? I try to search on GitHub, but it's difficult to find running code that we can directly use. The official evaluation code from the Qwen3-vl repo is actually not a complete code to reproduce the results in the report.

I just use the newest VLMEvalKit code to evaluate the results. Also, that's true. I can't reproduce the results either.

Oct 20 '25 05:10 hanzifan

Is there any evaluation code about Qwen3-VL? I try to search on GitHub, but it's difficult to find running code that we can directly use. The official evaluation code from the Qwen3-vl repo is actually not a complete code to reproduce the results in the report.

I just use the newest VLMEvalKit code to evaluate the results. Also, that's true. I can't reproduce the results either.

The results I reproduced from testing the original weights of qwen3vl-8B-Instruct are lower than those in the paper.

Oct 20 '25 05:10 hanzifan

Hello~ how many GPUs do you use so that can running qwen3vl-8B-Instruct? I use 8 A100 GPU but even qwen3vl-4B-Instruct will get the error of OOM

Oct 21 '25 09:10 jmq2025

Hello~ how many GPUs do you use so that can running qwen3vl-8B-Instruct? I use 8 A100 GPU but even qwen3vl-4B-Instruct will get the error of OOM

Hi, I use 8 A800 for deploying and sft qwen3vl-8B-Instruct, and it works well.

Oct 21 '25 09:10 hanzifan

Hello! could you please specify which Benchmark scores are inconsistent? How large is the gap between the results measured with VLMEvalKit and those officially reported by Qwen3-VL? We are working hard to align the evaluation settings. Thank your for your time!

Oct 23 '25 04:10 mjuicem

Hello! could you please specify which Benchmark scores are inconsistent? How large is the gap between the results measured with VLMEvalKit and those officially reported by Qwen3-VL? We are working hard to align the evaluation settings. Thank your for your time!

Thanks for your fast reply. Four results we produced are quite low: MMMU:64.1(-11.9),MMStar:71.5(-6.2),AI2D:85.4(-4.1),MathVista:80.3(-3.5). And one result is super hight: Hallusion:74.3(+10.5)

Oct 24 '25 07:10 hanzifan