OFA
OFA copied to clipboard
Question about the Visual Grounding inference result
Hi OFA team,
Thanks for this amazing work! I'm deeply impressed by it. I just have a question about the Visual Grounding result.
When the input text description of an object can be found in the image, the model usually gives the correct bounding box. For example, text "a blue turtle-like pokemon with round head" will give perfect result:
However, when I ask the model to find an object which is not in the image, the model will still give a wrong bounding box. For example, text "a black bird" will output:
In my use case, I hope the model can only answer the image related queries and reject unrelated queries. Do you have any suggestions or ideas for this? Would it help to finetune the model on a new curated dataset containing answerable and unanswerable questions? Or is it possible to fix this issue by doing postprocessing based on the current model?
The model I'm using is ofa_visual-grounding_refcoco_large_en
. Looking forward to your reply.
Thanks again for your time!
Yes, this is one significant problem of the current model. One way to tackle this problem is to compute the average probabilities of the output logits, and set a threshold for rejection. We might update our code with this option in the near future.
Hello, Dear Qin. I am wondering how to run inference on refcoco task. Do I have to write a script using fariseq? I'd appreciated it if you could help me!