A bug in qwen3vl
Thanks for the author's great efforts on this work. I found that the new version of qwen3vl code seems to have a bug. In qwen3_vl/model.py l81, the generation config does not have the “do_sample” parameter when it is set. It seems that this will cause a conflict.
Is there any evaluation code about Qwen3-VL? I try to search on GitHub, but it's difficult to find running code that we can directly use. The official evaluation code from the Qwen3-vl repo is actually not a complete code to reproduce the results in the report.
Is there any evaluation code about Qwen3-VL? I try to search on GitHub, but it's difficult to find running code that we can directly use. The official evaluation code from the
Qwen3-vlrepo is actually not a complete code to reproduce the results in the report.
I just use the newest VLMEvalKit code to evaluate the results. Also, that's true. I can't reproduce the results either.
Is there any evaluation code about Qwen3-VL? I try to search on GitHub, but it's difficult to find running code that we can directly use. The official evaluation code from the
Qwen3-vlrepo is actually not a complete code to reproduce the results in the report.I just use the newest VLMEvalKit code to evaluate the results. Also, that's true. I can't reproduce the results either.
The results I reproduced from testing the original weights of qwen3vl-8B-Instruct are lower than those in the paper.
Hello~ how many GPUs do you use so that can running qwen3vl-8B-Instruct? I use 8 A100 GPU but even qwen3vl-4B-Instruct will get the error of OOM
Hello~ how many GPUs do you use so that can running qwen3vl-8B-Instruct? I use 8 A100 GPU but even qwen3vl-4B-Instruct will get the error of OOM
Hi, I use 8 A800 for deploying and sft qwen3vl-8B-Instruct, and it works well.
Hello! could you please specify which Benchmark scores are inconsistent? How large is the gap between the results measured with VLMEvalKit and those officially reported by Qwen3-VL? We are working hard to align the evaluation settings. Thank your for your time!
Hello! could you please specify which Benchmark scores are inconsistent? How large is the gap between the results measured with VLMEvalKit and those officially reported by Qwen3-VL? We are working hard to align the evaluation settings. Thank your for your time!
Thanks for your fast reply. Four results we produced are quite low: MMMU:64.1(-11.9),MMStar:71.5(-6.2),AI2D:85.4(-4.1),MathVista:80.3(-3.5). And one result is super hight: Hallusion:74.3(+10.5)