transformers [Bug] Owlv2 Zero-Shot Object Detection

System Info

transformers==4.39.3

python==3.10.14

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
import requests
import torch

checkpoint=""google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([im.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(im)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

im

Expected behavior

Expected behavior should be as shown in the second official example here: https://huggingface.co/docs/transformers/main/en/tasks/zero_shot_object_detection

However, the final bounding boxes are still shifted. Please refer to the code above (taken from the official example)

Apr 08 '24 22:04 nisyad-ms

Hi,

Thanks for your interest in OWLv2. As shown in my demo notebook, you need to visualize the bounding boxes on the padded image rather than the original image.

This is also shown here: https://huggingface.co/docs/transformers/en/model_doc/owlv2#transformers.Owlv2ForObjectDetection.forward.example

Apr 09 '24 06:04 NielsRogge

Thanks, Niels. I saw your demo notebook and it works fine there. But if I try to reproduce the example below as it is, I don't see the expected results. Can you re-confirm? https://huggingface.co/docs/transformers/en/tasks/zero_shot_object_detection#text-prompted-zero-shot-object-detection-by-hand

Apr 09 '24 16:04 nisyad-ms

Did you visualize results on the unnormalized image?

Apr 10 '24 14:04 NielsRogge

Yes. If you can try to run the example I mentioned as it is, the final bboxes are not what are shown in the result image.

Apr 10 '24 20:04 nisyad-ms

Yes that example only works as it is for OWLv1. Perhaps we could add a disclaimer for OWLv2 that results need to be shown on the preprocessed image. Would you be up for opening a PR for that?

The docs is here: https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/zero_shot_object_detection.md

Apr 11 '24 06:04 NielsRogge

@NielsRogge @nisyad-ms I managed to show the preprocessed image with the correct boxes. Below is the full code.

import torch
import requests
import numpy as np
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD

checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")


def get_preprocessed_image(pixel_values):
    pixel_values = pixel_values.squeeze().numpy()
    unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
    unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
    unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
    unnormalized_image = Image.fromarray(unnormalized_image)
    return unnormalized_image


unnormalized_image = get_preprocessed_image(inputs.pixel_values)

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([unnormalized_image.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(unnormalized_image)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

unnormalized_image.show()

Is there an easy way to remove the grey area?

May 04 '24 08:05 jla524

Thanks @jla524 for the example. @NielsRogge also pointed to this.

+1 for removing the gray area.

May 06 '24 15:05 nisyad-ms

Yes there's an easy way to remove the padding, see https://discuss.huggingface.co/t/owl-v2-bounding-box-misalignment-problem/66181/6?u=nielsr

May 06 '24 21:05 NielsRogge

Thanks @NielsRogge! This worked for me:

import torch
import requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection

checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)

url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
image = Image.open(requests.get(url, stream=True).raw)
width, height = image.size

text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=image, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    target_sizes = torch.tensor([image.size[::-1]])
    results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]

draw = ImageDraw.Draw(image)

scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()

width_ratio = 1
height_ratio = 1

if width < height:
    width_ratio = width / height
elif height < width:
    height_ratio = height / width

for box, score, label in zip(boxes, scores, labels):
    xmin, ymin, xmax, ymax = box
    xmin /= width_ratio
    ymin /= height_ratio
    xmax /= width_ratio
    ymax /= height_ratio
    draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
    draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")

image.show()

May 07 '24 05:05 jla524

transformers transformers copied to clipboard

[Bug] Owlv2 Zero-Shot Object Detection

System Info

transformers==4.39.3

python==3.10.14

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard