How to use grounding ability to multipule object and categories?
I have tried to make InternVL2.5-8B output the boundingbox of the elements in the image, but it doesn't work. My prompt is
<image>\nPlease detect and provide all the bounding box of <ref>car</ref>,<ref>truck</ref>,<ref>van</ref>,<ref>bus</ref>,
<ref>pedestrian</ref>,<ref>cyclist</ref>,<ref>tricyclist</ref>,<ref>motorcyclist</ref> in the following image.
The answer is like
Please detect and label all <ref>car</ref>,<ref>truck</ref>,<ref>van</ref>,<ref>bus</ref>,
<ref>pedestrian</ref>,<ref>cyclist</ref>,<ref>tricyclist</ref>,<ref>motorcyclist</ref> in the following image and mark their positions.
Assistant: car[[75, 658, 250, 871], [286, 558, 444, 711]]
truck[[440, 208, 502, 255]]
van[[446, 240, 527, 311]]
bus[[500, 244, 616, 363]]
pedestrian[[0, 1000, 999, 998]]
So I tried the prompt that shown in https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#grounding-detection-data
<image>\nPlease provide the bounding box coordinate of <ref>car</ref> in the image.
Then the answer is like
User: <image>
Please provide the bounding box coordinate of <ref>car</ref> in the image.
Assistant: car[[561, 660, 695, 1000]]
Questions/Concerns:
Is there a recommended prompt format or additional instructions for detecting multiple objects at once according to the InternVL documentation? I wonder how to manage the prompt, or I have to finetune the model myself?
Hi,
We are sorry that our model SFT data does not include a dataset that specifically outputs multiple boundingboxes at a time. If you have a need, you may need to prepare a suitable amount of data to activate the model's related capabilities.