InternVL [Feature] 我希望能输出一张图片中两个打架的人的边界框，我该怎么设计prompt？

Motivation

1、给的prompt是\nPlease provide the bounding box coordinate of the region this sentence describes: {the person who is fighting in this image}，但是模型并没有理解打架的人这个信息，只是输出了人的边界框，也没有将两个人的边界框分开输出：

2、给的prompt是\nPlease provide the bounding box coordinate of the region this sentence describes: {the person who is fighting on the left} and Please provide the bounding box coordinate of the region this sentence describes: {the person who is fighting on the right},完全无法输出[x1,y1,x2,y2]格式的边界框。

3、给的prompt是\nPlease provide the bounding box coordinate of the region this sentence describes: {the person who is fighting on the left}，结果和1相同。

4、给的prompt是Please detect all fighters in the following image and mark their positions，模型会检测图片中所有的objects及其位置。

难道只能用多轮对话实现我的需求吗？

Related resources

No response

Additional context

No response

Sep 24 '24 09:09 claraore

To achieve the desired output format, consider specifying it explicitly in the prompt or exploring the use of a larger language model with enhanced capabilities.

Sep 24 '24 11:09 qishisuren123

To achieve the desired output format, consider specifying it explicitly in the prompt or exploring the use of a larger language model with enhanced capabilities.

Thanks for your advice,I'm gonna start with changing the prompt. The prompt3 "the person who is fighting on the left", is it not specific enough? Maybe I can describe the person's clothes to locate him, but my demand is to locate the fighters no matter what he was wearing.

Sep 25 '24 01:09 claraore