InternVL Grounding微调的prompt模版

请问面向Grounding任务的V1.2-plus的微调，prompt的模版是否是以下的格式： 'Please provide the bounding box coordinate of the region this sentence describes: XXX'， (这个prompt是参考自refcoco的评测脚本https://github.com/OpenGVLab/InternVL/blob/main/internvl_chat/eval/refcoco/evaluate_grounding.py#L248C18-L248C104 ) 即数据的格式是否如下：

{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: XXX"
    },
    {
      "from": "gpt",
      "value": "XXX[[253, 231, 733, 787]]"
    }
}

Mar 07 '24 09:03 Pro-flynn

在InternVL-Chat-V1-2-Plus中我们用的格式是这样：

{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"
    },
    {
      "from": "gpt",
      "value": "<ref>XXX</ref><box>[[253, 231, 733, 787]]</box>"
    }
}

Mar 07 '24 11:03 czczup

在InternVL-Chat-V1-2-Plus中我们用的格式是这样：

{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"
    },
    {
      "from": "gpt",
      "value": "<ref>XXX</ref><box>[[253, 231, 733, 787]]</box>"
    }
}

@czczup 请问负样本(即图中没有待检测的目标)的“gpt”的value 该如何写

Mar 07 '24 12:03 Pro-flynn

我们的v1.2-plus使用的中文数据中有一些构造的负样本，是这样的格式：

{"image": "train2017/000000179725.jpg", "conversations": [{"from": "human", "value": "<image>\n这是哪个旅游景点？"}, {"from": "gpt", "value": "这张图片可能并非拍摄的旅游景点，而是其他场景。"}], "id": 6076}
{"image": "train2017/000000000384.jpg", "conversations": [{"from": "human", "value": "<image>\n图片中的人物是谁？"}, {"from": "gpt", "value": "图片中似乎没有人物，因此我无法做出回答。"}], "id": 399}
{"image": "train2017/000000070125.jpg", "conversations": [{"from": "human", "value": "<image>\n能否就图片中人物的服饰给出详尽的描述？"}, {"from": "gpt", "value": "这张图片里没有人物，所以我无法提供关于服饰的描述。"}], "id": 26}

我们没有给Grounding任务造负样本，所以现在如果向模型提问：Please provide the bounding box coordinate of the region this sentence describes: xxx (一个图中不存在的东西)，模型还是会随机生成一个坐标框。

你如果想造负样本的话，类似这样的格式即可：

{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"
    },
    {
      "from": "gpt",
      "value": "Sorry, this image doesn't show any XXX."
    }
}

可以使用ChatGPT丰富一下回答模版的种类，在构造数据时随机选择。

Mar 07 '24 13:03 czczup

我们的v1.2-plus使用的中文数据中有一些构造的负样本，是这样的格式：
{"image": "train2017/000000179725.jpg", "conversations": [{"from": "human", "value": "<image>\n这是哪个旅游景点？"}, {"from": "gpt", "value": "这张图片可能并非拍摄的旅游景点，而是其他场景。"}], "id": 6076}
{"image": "train2017/000000000384.jpg", "conversations": [{"from": "human", "value": "<image>\n图片中的人物是谁？"}, {"from": "gpt", "value": "图片中似乎没有人物，因此我无法做出回答。"}], "id": 399}
{"image": "train2017/000000070125.jpg", "conversations": [{"from": "human", "value": "<image>\n能否就图片中人物的服饰给出详尽的描述？"}, {"from": "gpt", "value": "这张图片里没有人物，所以我无法提供关于服饰的描述。"}], "id": 26}
我们没有给Grounding任务造负样本，所以现在如果向模型提问：Please provide the bounding box coordinate of the region this sentence describes: xxx (一个图中不存在的东西)，模型还是会随机生成一个坐标框。

你如果想造负样本的话，类似这样的格式即可：
{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"
    },
    {
      "from": "gpt",
      "value": "Sorry, this image doesn't show any XXX."
    }
}
可以使用ChatGPT丰富一下回答模版的种类，在构造数据时随机选择。

@czczup 感谢您快速的回复！再咨询一个问题，在v1.2-plus的预训练或指令微调的阶段 Grounding任务的数据中有中文的prompt吗，即我基于v1.2-plus做Grounding的微调的话能否使用使用中文的prompt

Mar 07 '24 13:03 Pro-flynn

我们的v1.2-plus使用的中文数据中有一些构造的负样本，是这样的格式：
{"image": "train2017/000000179725.jpg", "conversations": [{"from": "human", "value": "<image>\n这是哪个旅游景点？"}, {"from": "gpt", "value": "这张图片可能并非拍摄的旅游景点，而是其他场景。"}], "id": 6076}
{"image": "train2017/000000000384.jpg", "conversations": [{"from": "human", "value": "<image>\n图片中的人物是谁？"}, {"from": "gpt", "value": "图片中似乎没有人物，因此我无法做出回答。"}], "id": 399}
{"image": "train2017/000000070125.jpg", "conversations": [{"from": "human", "value": "<image>\n能否就图片中人物的服饰给出详尽的描述？"}, {"from": "gpt", "value": "这张图片里没有人物，所以我无法提供关于服饰的描述。"}], "id": 26}
我们没有给Grounding任务造负样本，所以现在如果向模型提问：Please provide the bounding box coordinate of the region this sentence describes: xxx (一个图中不存在的东西)，模型还是会随机生成一个坐标框。你如果想造负样本的话，类似这样的格式即可：
{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"
    },
    {
      "from": "gpt",
      "value": "Sorry, this image doesn't show any XXX."
    }
}
可以使用ChatGPT丰富一下回答模版的种类，在构造数据时随机选择。
@czczup 感谢您快速的回复！再咨询一个问题，在v1.2-plus的预训练或指令微调的阶段 Grounding任务的数据中有中文的prompt吗，即我基于v1.2-plus做Grounding的微调的话能否使用使用中文的prompt

我们没在Grounding任务使用中文的prompt，不过检测任务中（Objects365数据集）用了一些，prompt是gpt生成的一些模板，类似这样：

在Grounding微调中使用中文的prompt是可以的。

Mar 07 '24 13:03 czczup

@czczup 感谢请问可以在训练过程中，验证一下验证集上的指标吗

Mar 07 '24 15:03 Pro-flynn

感觉想要finetune coco caption涨点太难了，咋tune都掉点...

Mar 12 '24 10:03 czczup

我们的v1.2-plus使用的中文数据中有一些构造的负样本，是这样的格式：
{"image": "train2017/000000179725.jpg", "conversations": [{"from": "human", "value": "<image>\n这是哪个旅游景点？"}, {"from": "gpt", "value": "这张图片可能并非拍摄的旅游景点，而是其他场景。"}], "id": 6076}
{"image": "train2017/000000000384.jpg", "conversations": [{"from": "human", "value": "<image>\n图片中的人物是谁？"}, {"from": "gpt", "value": "图片中似乎没有人物，因此我无法做出回答。"}], "id": 399}
{"image": "train2017/000000070125.jpg", "conversations": [{"from": "human", "value": "<image>\n能否就图片中人物的服饰给出详尽的描述？"}, {"from": "gpt", "value": "这张图片里没有人物，所以我无法提供关于服饰的描述。"}], "id": 26}
我们没有给Grounding任务造负样本，所以现在如果向模型提问：Please provide the bounding box coordinate of the region this sentence describes: xxx (一个图中不存在的东西)，模型还是会随机生成一个坐标框。

你如果想造负样本的话，类似这样的格式即可：
{
  "id": 0,
  "image": "images/5.png",
  "conversations": [
    {
      "from": "human",
      "value": "<image>\nPlease provide the bounding box coordinate of the region this sentence describes: <ref>XXX</ref>"
    },
    {
      "from": "gpt",
      "value": "Sorry, this image doesn't show any XXX."
    }
}
可以使用ChatGPT丰富一下回答模版的种类，在构造数据时随机选择。

1、我发现在Grounding微调时造负样本(如以上造gt是"Sorry, this image doesn't show any XXX.")无效，我做了好多实验, 发现无论怎么调，不是矫枉过正(正负样本都输出"Sorry,this image doesn't show any XXX")就是训练不成功(负样本无法输出"this image doesn't show any XXX")；而我做VQA的相关微调实验结果就很正常，请问这是不是由于模型在预训练阶段没有造负样本，所以在微调阶段怎么都掰不过来 2、负样本推理时不是给随机框而是给了一个相近语义的框，比如要检XXX的人，如果图中没有就会给人的框以上2个问题想跟您讨论一下 @czczup

Mar 15 '24 09:03 Pro-flynn

现在这两个问题还有吗

Mar 25 '24 13:03 czczup

请问下正负样本的构造必须要加<image>\n 前缀吗？关于这块儿有没有其他说明呢 @czczup

Mar 27 '24 08:03 Ying-Kang

请问下正负样本的构造必须要加<image>\n 前缀吗？关于这块儿有没有其他说明呢 @czczup

所有有图像的数据都需要在第一轮对话的question里加上<image>\n，如果是纯文本数据不要加这个前缀。

Mar 27 '24 08:03 czczup

现在这两个问题还有吗

@czczup 请问这2个问题您有跟进吗或者现在有啥建议的解决方案吗

Mar 28 '24 06:03 Pro-flynn

现在这两个问题还有吗

@czczup 请问这2个问题您有跟进吗或者现在有啥建议的解决方案吗

@czczup 大佬这2个问题会在您最新的模型版本中解决吗

Apr 17 '24 12:04 Pro-flynn

想问下目前1.5版本的VG任务微调，prompt版本还是一样么？在上传的playground中，sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl文件里面涉及的coco定位任务prompt似乎并没有的tag？

May 11 '24 07:05 MVP-D77

想问下目前1.5版本的VG任务微调，prompt版本还是一样么？在上传的playground中，sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl文件里面涉及的coco定位任务prompt似乎并没有的tag？

能否请问一下，要对其进行微调数据集标注格式应该用什么呀，还是prompt呢

Jun 28 '24 07:06 Gav1n-is-here

想问下目前1.5版本的VG任务微调，prompt版本还是一样么？在上传的playground中，sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl文件里面涉及的coco定位任务prompt似乎并没有的tag？

prompt基本都是一样的，XXX[[253, 231, 733, 787]]这样的结构也是一直保持的

Jul 11 '24 13:07 ErfeiCui