OFA Question about the Visual Grounding inference result

Question about the Visual Grounding inference result

Open Owen-Qin opened this issue 2 years ago • 2 comments

Hi OFA team,

Thanks for this amazing work! I'm deeply impressed by it. I just have a question about the Visual Grounding result.

When the input text description of an object can be found in the image, the model usually gives the correct bounding box. For example, text "a blue turtle-like pokemon with round head" will give perfect result:

correct

However, when I ask the model to find an object which is not in the image, the model will still give a wrong bounding box. For example, text "a black bird" will output: wrong

In my use case, I hope the model can only answer the image related queries and reject unrelated queries. Do you have any suggestions or ideas for this? Would it help to finetune the model on a new curated dataset containing answerable and unanswerable questions? Or is it possible to fix this issue by doing postprocessing based on the current model?

The model I'm using is ofa_visual-grounding_refcoco_large_en. Looking forward to your reply. Thanks again for your time!

Jan 05 '23 08:01 Owen-Qin

Yes, this is one significant problem of the current model. One way to tackle this problem is to compute the average probabilities of the output logits, and set a threshold for rejection. We might update our code with this option in the near future.

Jan 05 '23 14:01 JustinLin610

Hello, Dear Qin. I am wondering how to run inference on refcoco task. Do I have to write a script using fariseq? I'd appreciated it if you could help me!

Apr 03 '23 15:04 Practicing7

OFA OFA copied to clipboard

Question about the Visual Grounding inference result

OFA
OFA copied to clipboard