OFA Inference on visual grounding task

Hello, I tried your huggingface version, and the codes followed your toy example:

>>> from PIL import Image
>>> from torchvision import transforms
>>> from transformers import OFATokenizer, OFAModel
>>> from generate import sequence_generator

>>> mean, std = [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]
>>> resolution = 480
>>> patch_resize_transform = transforms.Compose([
        lambda image: image.convert("RGB"),
        transforms.Resize((resolution, resolution), interpolation=Image.BICUBIC),
        transforms.ToTensor(), 
        transforms.Normalize(mean=mean, std=std)
    ])

>>> tokenizer = OFATokenizer.from_pretrained(ckpt_dir)

>>> txt = "which region does the text "{refexp}" describe?"
>>> inputs = tokenizer([txt], return_tensors="pt").input_ids
>>> img = Image.open(path_to_image)
>>> patch_img = patch_resize_transform(img).unsqueeze(0)

>>> # using the generator of huggingface version
>>> model = OFAModel.from_pretrained(ckpt_dir, use_cache=False)
>>> gen = model.generate(inputs, patch_images=patch_img, num_beams=5, no_repeat_ngram_size=3) 

>>> print(tokenizer.batch_decode(gen, skip_special_tokens=True))

And I use bin2coord method in your repo to convert tokens to real numbers. But I found the results unreasonable, and the accuracy is pretty low. Is there anything I have missed to reproduce the remarkable performance of OFA? Is it because of resolution? BTW, I am using OFA-Huge, and a single image in RefCOCO is used, the text is composed in visual grounding manner.

Sep 26 '22 15:09 huangjy-pku

Seemingly there are still some problems. I guess I'll provide colab notebooks for you to check it out. Perhaps after a week or so...

Sep 27 '22 10:09 JustinLin610

I have noticed that OFA is fine-tuned at resolution of 512, so I modified the patch_resize_transform and tried again. But it still didn't work. By the way, everything other than visual grounding worked as expected (captioning and VQA). So It seems only problem with grounding task.

And it is not likely the problem of post-processing. I found the raw output in the form of <bin_x> had already gone unreasonable.

Sep 27 '22 11:09 huangjy-pku

Wait.. I think I did not opensourced the huge for RefCOCO, so which one are you using? the pretrained one?

Nov 04 '22 06:11 JustinLin610

I just followed the instructions and used the pre-trained weights at https://huggingface.co/OFA-Sys/OFA-huge.

Nov 04 '22 13:11 huangjy-pku

@JustinLin610 I wanted to clarify, is it possible to use just a pretrained model (OFA huge) for a visual grounding on my own data? Or is it meant that I have to finetune the model firstly anyway?

I saw this colab notebook, where you use ofa large for several tasks, but didn't understand if it is a special model version or just a pretrained checkpoint.

Nov 14 '22 22:11 25icecreamflavors