Yi icon indicating copy to clipboard operation
Yi copied to clipboard

Yi-VL是否具有grounding的功能?

Open neuwangmq opened this issue 1 year ago • 5 comments

Reminder

  • [X] I have searched the Github Discussion and issues and have not found anything similar to this.

Motivation

您好,我看VL的训练数据包括了refcoco数据集,那么,在推理的时候是否支持该功能呢,是否支持输入图片和提示工程,能够输出想要输出的检测框位置呢?如果可以,是否有示例的提示词模板可以提供给测试一下。谢谢!!!

Solution

No response

Alternatives

No response

Anything Else?

No response

Are you willing to submit a PR?

  • [ ] I'm willing to submit a PR!

neuwangmq avatar Feb 01 '24 03:02 neuwangmq

您可以尝试以下提示词模板,问题是:“Given the image, provide the bounding box coordinate of the region this sentence describes:”

question = DEFAULT_IMAGE_TOKEN + '\nGiven the image, provide the bounding box coordinate of the region this sentence describes:\n'+ data['sent'] + '\n' 

注意输出的坐标可能是normalize后的坐标,所以需要按照图片比例转换回来。输出的 (x, y) 需要变换成 (x * 999 /image_width,y * 999 / image_height)

Yi-VL theoretically is capable of providing the bounding box of the target in the given image. Its performance is not guaranteed but it should work well. Please try using the following prompt as a template:

 'Given the image, provide the bounding box coordinate of the region this sentence describes:'

The output coordinates are the result of a normalized image, therefore you might need to project the coordinates back to the input image by (x * 999 / image_width, y * 999 / image_height).

markli404 avatar Feb 01 '24 03:02 markli404

图片 非常感谢

neuwangmq avatar Feb 02 '24 10:02 neuwangmq

我尝试使用 Given the image, provide the bounding boxes coordinate of the regions this sentence describes: all the cats in the image 类似提示词想让模型画出图片中的所有的猫,感觉模型好像不太能够画出全部的猫。请问您是否知道有什么方式或者提示技巧能够得到全部检测框的位置吗? @markli404

neuwangmq avatar Feb 02 '24 11:02 neuwangmq

我尝试使用 Given the image, provide the bounding boxes coordinate of the regions this sentence describes: all the cats in the image 类似提示词想让模型画出图片中的所有的猫,感觉模型好像不太能够画出全部的猫。请问您是否知道有什么方式或者提示技巧能够得到全部检测框的位置吗? @markli404

你可以尝试一下,先让模型识别有几个猫,然后再分别给出他们的坐标

markli404 avatar Feb 02 '24 15:02 markli404

It appears that the correct way to recover the coordinates is (x * image_width / 999, y * image_height / 999). You can verify this with the cat image above. The image size is only 599×400, yet the output coordinates are (564, 284) and (957, 845).

zenjieli avatar Feb 13 '24 16:02 zenjieli

image 按照上面说的方法成功了,感谢!

TommyZihao avatar May 15 '24 08:05 TommyZihao