OFA
OFA copied to clipboard
Does OFA supports multi objects visual grounding?
Currently, I found that all hypos are pointing to the same car in the Generic Interface Colab. If I ask which region does the text " a car " describe? with the following image, is it possible to output the positions of all three cars?

Testing code:
import cv2
import numpy
from google.colab.patches import cv2_imshow
# download image
! wget https://ofa-beijing.oss-cn-beijing.aliyuncs.com/datasets/show_case/test_grounded_qa.jpeg -O test.jpeg
# construct instruction
image = Image.open('./test.jpeg')
instruction = 'which region does the text " a car " describe?'
# Construct input sample & preprocess for GPU if cuda available
sample = construct_sample(image, instruction)
sample = utils.move_to_cuda(sample) if use_cuda else sample
sample = utils.apply_to_sample(apply_half, sample) if use_fp16 else sample
# Generate result
with torch.no_grad():
hypos = task.inference_step(generator, models, sample)
# display result
w_resize_ratio = task.cfg.patch_image_size / image.width
h_resize_ratio = task.cfg.patch_image_size / image.height
img = cv2.cvtColor(numpy.asarray(image), cv2.COLOR_RGB2BGR)
for hypo in hypos[0]:
tokens, bins, imgs = decode_fn(hypo["tokens"], task.tgt_dict, task.bpe, generator)
coord_list = bin2coord(bins, w_resize_ratio, h_resize_ratio)
cv2.rectangle(
img,
(int(coord_list[0]), int(coord_list[1])),
(int(coord_list[2]), int(coord_list[3])),
(0, 255, 0),
3
)
cv2_imshow(img)
This is a next step to do for us, as the data for visual grounding only contain one target bounding box for one sample. I think what you want is more like a open vocabulary object detection, and we'll later try to figure out how to construct such data to reach this objective.
Sounds good, I will also take a look :)