InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

[Bug] 最新InternVL3.5执行2d Grounding结果错位严重

Open FeiMa-REC opened this issue 4 months ago • 10 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

使用最新发布的InternVL3_5-8B版本,执行2d grounding给出的results错位很多,这是什么问题导致的?

我的prompt: """ Outline the positions of the vehicles and pedestrians and output all the coordinates in JSON format.

    Target type:
        -car: Sedan, SUV, off-road vehicle
        -truck: Truck, freight car
        -bus: Public bus, coach
        -motorcycle: Motorcycles, electric vehicles
        -bicycle: bicycle
        -person: Pedestrian
    
    Testing standard
    1. Only detect clearly visible and complete targets
    2. Ignore partially occluded or blurred targets
    3. The bounding box should closely adhere to the outline of the target
    4. Each target is detected only once
    5. Focus on moving targets on the road

    output format:
        [
        {"bbox_2d": [139, 768, 315, 954], "label": "truck"},
        {"bbox_2d": [366, 679, 536, 849], "label": "car"}
        ]

        Please start the precise detection now

"""

2d结果如下

Image

Reproduction

输出的坐标已经经过了逆归一化处理:

原始输出:

[ {"bbox_2d": [360, 470, 635, 650], "label": "car"}, {"bbox_2d": [206, 587, 277, 704], "label": "person"}, {"bbox_2d": [683, 548, 761, 670], "label": "motorcycle"}, {"bbox_2d": [811, 525, 887, 604], "label": "motorcycle"}, {"bbox_2d": [797, 495, 879, 589], "label": "person"} ]

逆变换流程:

    x1 = round((x1 * 3840) / 1000)
    y1 = round((y1 * 2160) / 1000)
    x2 = round((x2 * 3840) / 1000)
    y2 = round((y2 * 2160) / 1000)

Environment

model:InternVL3_5-8B

Error traceback


FeiMa-REC avatar Sep 02 '25 02:09 FeiMa-REC

我理解逆变化流程是 x1 = x1 / 1000 * w 而不是 x1 = x1 / w * 1000

Weiyun1025 avatar Sep 02 '25 03:09 Weiyun1025

我理解逆变化流程是 x1 = x1 / 1000 * w 而不是 x1 = x1 / w * 1000

应该是我之前的理解有误,现在修改了逆变换流程,重新运行了任务,结果似乎得到了改善,但是仍然存在结果框偏移问题。

最新的模型输出:

[
    {"bbox_2d": [420, 390, 615, 640], "label": "car"},
    {"bbox_2d": [692, 419, 760, 547], "label": "motorcycle"},
    {"bbox_2d": [808, 436, 838, 520], "label": "person"},
    {"bbox_2d": [755, 436, 783, 500], "label": "person"}
]

逆变换流程:

    x1 = round((x1 / 1000) * 3840)
    y1 = round((y1 / 1000) * 2160)
    x2 = round((x2 / 1000) * 3840)
    y2 = round((y2 / 1000) * 2160)

Image

FeiMa-REC avatar Sep 02 '25 03:09 FeiMa-REC

建议用我们训练时候的prompt试一下Please provide the bounding box coordinate of the region this sentence describes: <ref>{}</ref>,具体可以参考这个脚本。当前多模态模型同时输出多个bbox做detection的能力确实不太行,可以试试每次只出一个bbox

Weiyun1025 avatar Sep 02 '25 05:09 Weiyun1025

建议用我们训练时候的prompt试一下Please provide the bounding box coordinate of the region this sentence describes: <ref>{}</ref>,具体可以参考这个脚本。当前多模态模型同时输出多个bbox做detection的能力确实不太行,可以试试每次只出一个bbox

好的,感谢你的答复,我再去试试:》

FeiMa-REC avatar Sep 02 '25 05:09 FeiMa-REC

@FeiMa-REC 请问一下,问题是否修复了

whcjb avatar Sep 05 '25 01:09 whcjb

@FeiMa-REC 请问一下,问题是否修复了

目前测试看来一次直接输出多个目标的2d grounding,只有Seed1.6VL表现最好,如果是一次只输出单个bbox,InternVL3.5也够用。

FeiMa-REC avatar Sep 05 '25 01:09 FeiMa-REC

@FeiMa-REC 请问一下,问题是否修复了

目前测试看来一次直接输出多个目标的2d grounding,只有Seed1.6VL表现最好,如果是一次只输出单个bbox,InternVL3.5也够用。

多谢。请问这个是在自定义数据集上微调得到的结论吗?

whcjb avatar Sep 05 '25 01:09 whcjb

@FeiMa-REC 请问一下,问题是否修复了

目前测试看来一次直接输出多个目标的2d grounding,只有Seed1.6VL表现最好,如果是一次只输出单个bbox,InternVL3.5也够用。

多谢。请问这个是在自定义数据集上微调得到的结论吗?

是的

FeiMa-REC avatar Sep 05 '25 01:09 FeiMa-REC

原来坐标是是0-1000啊,难怪输出一直不对

lin12058 avatar Sep 09 '25 18:09 lin12058

请问下大概多少b的internvl3.5 2d grouding能接受呢,我试了试4B的发现精度非常差 (对比之下qwen的omni 30b-a3b则非常精准)

lywy233 avatar Sep 28 '25 03:09 lywy233