InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] InternVL3做目标检测,坐标漂移

Open deepblacksky opened this issue 7 months ago • 6 comments

InternVL3做目标检测,返回的坐标不是原图的坐标,偏移很大, 14B 8B都是一样的结果

图片

Image

prompt: "你是一个高级视觉分析模型,请严格按照步骤执行:

  1. 检测图像中所有人物,生成每个人的边界框坐标(格式:x1,y1,x2,y2,基于像素值)。
  2. 输出严格的JSON格式,包含以下字段: { "persons": [ { "bbox": [x1, y1, x2, y2] }, ... ] } "

输出 { "persons": [ { "bbox": [155, 180, 440, 700] } ] }

Image

deepblacksky avatar Apr 25 '25 08:04 deepblacksky

Documents 有说,grounding的output是相对坐标。

def normalize_coordinates(box, image_width, image_height):
    x1, y1, x2, y2 = box
    normalized_box = [
        round((x1 / image_width) * 1000),
        round((y1 / image_height) * 1000),
        round((x2 / image_width) * 1000),
        round((y2 / image_height) * 1000)
    ]
    return normalized_box

zliucz avatar Apr 26 '25 09:04 zliucz

@zliucz After normalization, the bboxes are still very off, I wonder if this is a model issue

Image

lilyzhng avatar Jun 29 '25 23:06 lilyzhng

我这边测试一样,grounding 结果一塌糊涂,官方也没有任何用lmdeploy api 推理的 ,能够复现grounding的代码

ZanePoe avatar Jul 30 '25 03:07 ZanePoe

我看起來像是給你模型輸入的bbox位置 要不嘗試看看請他輸出normalize後的結果 或是把你的圖片reaize看看 import cv2

image_path = "437349580-75fdd40f-0cdc-46ac-81eb-74a24d22d873.png"

img = cv2.imread(image_path) img = cv2.resize(img,(1024,1024)) cv2.rectangle(img, (155, 180), (440, 700), (0, 0, 255), 2) cv2.imshow("img",img) cv2.waitKey()

Image

Xx46883339 avatar Aug 07 '25 05:08 Xx46883339

same problem

Lwen1243 avatar Aug 11 '25 03:08 Lwen1243

cogVLM grounds far better than InternVL does. https://huggingface.co/zai-org/cogvlm-grounding-generalist-hf

hwang136 avatar Sep 18 '25 16:09 hwang136