How can I determine which region of the image the model is focusing on when answering a specific question?
How can I determine which region of the image the model is focusing on when answering a specific question?, Does InternVL use Cross-Attention between images and text? If so, how can I extract the attention between a specific region of the image and a specific word? Thanks
Thank you for your interest in our work. You can try to achieve such effect by adding an extra prompt Answer the question according to the <ref>region</ref><box>[[x1, y1, x2, y2]]</box>. We train our models with region recognition tasks, so our model is able to understand the box coordinates. For more details about region recognition, you can refer to our papers (e.g., The All-Seeing Project and The All-Seeing Project V2).
Thank you for your interest in our work. You can try to achieve such effect by adding an extra prompt
Answer the question according to the <ref>region</ref><box>[[x1, y1, x2, y2]]</box>. We train our models with region recognition tasks, so our model is able to understand the box coordinates. For more details about region recognition, you can refer to our papers (e.g., The All-Seeing Project and The All-Seeing Project V2).
box coordinates is normalize to[0, 1] according to input image size ?