Qwen3-VL co-ordinate and bounding box errors (grounding errors)
Hi Qwen3-VL bounding boxes and co-ordinates appear to be incorrect in both 4B (no co-ordinates at all) and 8B (poor localisation). This occurs even in the FP16 versions of these models so not quabtisation related.
I can see theat when the convert_hf_to_gguf.py is run the non vison layers of the vision tower are removedr - im not sure if this is the cause of the problem.
this does not occur in huggingface transformers even for the same base model quantised to 4 bits
The rtoblem is not isolated to python-api it occurs also in llama-mtmd-cli.exe
see also here.. https://github.com/JamePeng/llama-cpp-python/issues/20
Coordinates in qwen3vl are relative to a 1000x1000 grid. You need to rescale them back to the original image size.
See #16880
@ayayakirara Im already scaling the output coordinates from the 1000x1000 coordinate frame, I already have the hugging face transformers model performing perfectly. Please look carefully at the coordinate results. The problem is the results are in aproximately the right magnitude but poor accuracy in 8V(or no coordinates at all for 4B).
i did follow the breadcrumbs you gave and found https://github.com/ggml-org/llama.cpp/pull/16878 https://github.com/ggml-org/llama.cpp/issues/13694
so looks like the problem is the clip.cpp implementation which has aready been fixed in some branches i can see thyey have used the same model I have and are getting correct coordinates in particula clip resizing to min only seems to be one issue which would cause the inaccuracies.
Do you have STR and the exact commit / version of llama.cpp you are using? The instructions for how to get the llama.cpp version/commit is in the Eval Bug issue template