OFA Does OFA supports multi objects visual grounding?

Currently, I found that all hypos are pointing to the same car in the Generic Interface Colab. If I ask which region does the text " a car " describe? with the following image, is it possible to output the positions of all three cars?

Testing code:

import cv2
import numpy
from google.colab.patches import cv2_imshow

# download image
! wget https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/show_case/test_grounded_qa.jpeg -O test.jpeg

# construct instruction
image = Image.open('./test.jpeg')
instruction = 'which region does the text " a car " describe?'

# Construct input sample & preprocess for GPU if cuda available
sample = construct_sample(image, instruction)
sample = utils.move_to_cuda(sample) if use_cuda else sample
sample = utils.apply_to_sample(apply_half, sample) if use_fp16 else sample

# Generate result
with torch.no_grad():
    hypos = task.inference_step(generator, models, sample)

# display result
w_resize_ratio = task.cfg.patch_image_size / image.width
h_resize_ratio = task.cfg.patch_image_size / image.height
img = cv2.cvtColor(numpy.asarray(image), cv2.COLOR_RGB2BGR)

for hypo in hypos[0]:
    tokens, bins, imgs = decode_fn(hypo["tokens"], task.tgt_dict, task.bpe, generator)

    coord_list = bin2coord(bins, w_resize_ratio, h_resize_ratio)
    cv2.rectangle(
        img,
        (int(coord_list[0]), int(coord_list[1])),
        (int(coord_list[2]), int(coord_list[3])),
        (0, 255, 0),
        3
    )

cv2_imshow(img)

Sep 21 '22 09:09 leng-yue

This is a next step to do for us, as the data for visual grounding only contain one target bounding box for one sample. I think what you want is more like a open vocabulary object detection, and we'll later try to figure out how to construct such data to reach this objective.

Sep 21 '22 16:09 JustinLin610

Sounds good, I will also take a look :)

Sep 22 '22 00:09 leng-yue