How to use grounding ability to multipule object and categories?

Open ddsk1 opened this issue 9 months ago • 1 comments

I have tried to make InternVL2.5-8B output the boundingbox of the elements in the image, but it doesn't work. My prompt is

<image>\nPlease detect and provide all the bounding box of <ref>car</ref>,<ref>truck</ref>,<ref>van</ref>,<ref>bus</ref>,
<ref>pedestrian</ref>,<ref>cyclist</ref>,<ref>tricyclist</ref>,<ref>motorcyclist</ref> in the following image.

The answer is like

Please detect and label all <ref>car</ref>,<ref>truck</ref>,<ref>van</ref>,<ref>bus</ref>,
<ref>pedestrian</ref>,<ref>cyclist</ref>,<ref>tricyclist</ref>,<ref>motorcyclist</ref> in the following image and mark their positions.
Assistant: car[[75, 658, 250, 871], [286, 558, 444, 711]]
truck[[440, 208, 502, 255]]
van[[446, 240, 527, 311]]
bus[[500, 244, 616, 363]]
pedestrian[[0, 1000, 999, 998]]

So I tried the prompt that shown in https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#grounding-detection-data

<image>\nPlease provide the bounding box coordinate of <ref>car</ref> in the image.

Then the answer is like

User: <image>
Please provide the bounding box coordinate of <ref>car</ref> in the image.
Assistant: car[[561, 660, 695, 1000]]

Questions/Concerns:

Is there a recommended prompt format or additional instructions for detecting multiple objects at once according to the InternVL documentation? I wonder how to manage the prompt, or I have to finetune the model myself?

Mar 13 '25 08:03 ddsk1

Hi,

We are sorry that our model SFT data does not include a dataset that specifically outputs multiple boundingboxes at a time. If you have a need, you may need to prepare a suitable amount of data to activate the model's related capabilities.

Apr 02 '25 12:04 yuecao0119