InternVL icon indicating copy to clipboard operation
InternVL copied to clipboard

How can I determine which region of the image the model is focusing on when answering a specific question?

Open phongkhanh opened this issue 10 months ago • 2 comments

How can I determine which region of the image the model is focusing on when answering a specific question?, Does InternVL use Cross-Attention between images and text? If so, how can I extract the attention between a specific region of the image and a specific word? Thanks

phongkhanh avatar Feb 21 '25 08:02 phongkhanh

Thank you for your interest in our work. You can try to achieve such effect by adding an extra prompt Answer the question according to the <ref>region</ref><box>[[x1, y1, x2, y2]]</box>. We train our models with region recognition tasks, so our model is able to understand the box coordinates. For more details about region recognition, you can refer to our papers (e.g., The All-Seeing Project and The All-Seeing Project V2).

Weiyun1025 avatar Aug 31 '25 04:08 Weiyun1025

Thank you for your interest in our work. You can try to achieve such effect by adding an extra prompt Answer the question according to the <ref>region</ref><box>[[x1, y1, x2, y2]]</box>. We train our models with region recognition tasks, so our model is able to understand the box coordinates. For more details about region recognition, you can refer to our papers (e.g., The All-Seeing Project and The All-Seeing Project V2).

box coordinates is normalize to[0, 1] according to input image size ?

cch2016 avatar Nov 14 '25 05:11 cch2016