InternVL Visual Grounding Results

I have tried visual grounding for InterVL2.5-8B vs. Qwen2.5-VL-7B not using refCOCO but another referring det dataset and I always found Qwen2.5-VL performance is almost 2x better. I am wondering if I am donig something wrong or did you close-source the original weights for InternVL2.5 that had better grounding ability.

I followed this issue here and I was able to get initial grounding results: https://github.com/OpenGVLab/InternVL/issues/359.

However, I faced two major issues: 1- The model always predicts one bounding box; it cant predict more than one box per image to capture all the instances and I am not sure why this is the case.

2- I have tried to modify it to predict the bounding box per block, where anyway the image gets divided into these blocks + the thumbnail image in the preprocessing of the input. I notice results get better, but it is quite sensitive to how the blocks are divided, which are quite tied to the aspect ratio of the original image in your code.

These issues don't occur with Qwe2.5-VL model so I am not sure how is it able to work better when is it only predicting one box per generation/image. Any suggestion on how to get InternVL2.5 to work correctly on grounding reflecting its true performance that shows on refCOCO better perf. than Qw2.5-VL would be appreciated. In fact if you can create a tutorial code for visual grounding akin to wha Qwen2.5-VL did on a public image beyond refCOCO that would be the best I can test it directly to ensure its not sth on my side. Thx

Note I also tried the demo it doesnt work like before, when InternVL3-78B was just released the grounding ability was quite top notch I dont know what happened but I am attaching here it keep giving me this: "I can't draw directly on images, but you can create bounding boxes using an image editing tool. Here's how you can do it: ..."

Jul 28 '25 08:07 MSiam

we need Visual Grounding demo

Aug 29 '25 08:08 ZanePoe

Thank you for your interest in our work. Please refer to our evaluation scripts for how to deploy our model in the visual grounding task. Since our training data for this task mainly follows a specific format, other general prompts may degrade model performance or fail to activate bounding box generation.

Aug 30 '25 03:08 Weiyun1025

Thanks for your response. I actually tried what was reported in the issue I linked to in my post:

"Please provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"

I still didn't get good results as I anticipated, but I am suspecting the model is quite sensitive to the image aspect ratio. Aside from the sensitivity to the prompt which is a universal issue across grounding models.

I can try a simple example image with simple inference code. I will add it here, and I can provide you InternVL2.5 & 3 output on it vs. Qwen2.5-VL + the code I used building on your own eval code. So you can let me know if there are any advice on how to improve the results. I will post these soon here. Thx

Aug 30 '25 05:08 MSiam

I am adding here a gist that I used to perform inference with InterVL2.5 and 3: https://gist.github.com/MSiam/735ab7bad8b4b232ed858e8ecec5a1f8. You will also find here, https://drive.google.com/drive/folders/1KkieYGk81EiReXwM3cxjYcFv-HpyZFoi?usp=sharing, the example I used with expression "moving elephants" and I also tried "elephants" the output from Qwen2.5VL, InternVL2.5 and InternVL3 are added as well. It also has other examples just to confirm the issue, like Cakes.png example with expression "cup cakes". The later didnt work at all it gave wrong response as "'cup cakes[[61, 346, 9 870, 820]]'". Any suggestion on why its always predicting one bounding box and sometimes even wrong format? Thanks.

Sep 12 '25 16:09 MSiam