transformers
transformers copied to clipboard
[Bug] Owlv2 Zero-Shot Object Detection
System Info
transformers==4.39.3
python==3.10.14
Who can help?
No response
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
import requests
import torch
checkpoint=""google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)
text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
target_sizes = torch.tensor([im.size[::-1]])
results = processor.post_process_object_detection(outputs, threshold=0.1, target_sizes=target_sizes)[0]
draw = ImageDraw.Draw(im)
scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()
for box, score, label in zip(boxes, scores, labels):
xmin, ymin, xmax, ymax = box
draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")
im
Expected behavior
Expected behavior should be as shown in the second official example here: https://huggingface.co/docs/transformers/main/en/tasks/zero_shot_object_detection
However, the final bounding boxes are still shifted. Please refer to the code above (taken from the official example)
Hi,
Thanks for your interest in OWLv2. As shown in my demo notebook, you need to visualize the bounding boxes on the padded image rather than the original image.
This is also shown here: https://huggingface.co/docs/transformers/en/model_doc/owlv2#transformers.Owlv2ForObjectDetection.forward.example
Thanks, Niels. I saw your demo notebook and it works fine there. But if I try to reproduce the example below as it is, I don't see the expected results. Can you re-confirm? https://huggingface.co/docs/transformers/en/tasks/zero_shot_object_detection#text-prompted-zero-shot-object-detection-by-hand
Did you visualize results on the unnormalized image?
Yes. If you can try to run the example I mentioned as it is, the final bboxes are not what are shown in the result image.
Yes that example only works as it is for OWLv1. Perhaps we could add a disclaimer for OWLv2 that results need to be shown on the preprocessed image. Would you be up for opening a PR for that?
The docs is here: https://github.com/huggingface/transformers/blob/main/docs/source/en/tasks/zero_shot_object_detection.md
@NielsRogge @nisyad-ms I managed to show the preprocessed image with the correct boxes. Below is the full code.
import torch
import requests
import numpy as np
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
from transformers.utils.constants import OPENAI_CLIP_MEAN, OPENAI_CLIP_STD
checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
im = Image.open(requests.get(url, stream=True).raw)
text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=im, return_tensors="pt")
def get_preprocessed_image(pixel_values):
pixel_values = pixel_values.squeeze().numpy()
unnormalized_image = (pixel_values * np.array(OPENAI_CLIP_STD)[:, None, None]) + np.array(OPENAI_CLIP_MEAN)[:, None, None]
unnormalized_image = (unnormalized_image * 255).astype(np.uint8)
unnormalized_image = np.moveaxis(unnormalized_image, 0, -1)
unnormalized_image = Image.fromarray(unnormalized_image)
return unnormalized_image
unnormalized_image = get_preprocessed_image(inputs.pixel_values)
with torch.no_grad():
outputs = model(**inputs)
target_sizes = torch.tensor([unnormalized_image.size[::-1]])
results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]
draw = ImageDraw.Draw(unnormalized_image)
scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()
for box, score, label in zip(boxes, scores, labels):
xmin, ymin, xmax, ymax = box
draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")
unnormalized_image.show()
Is there an easy way to remove the grey area?
Thanks @jla524 for the example. @NielsRogge also pointed to this.
- +1 for removing the gray area.
Yes there's an easy way to remove the padding, see https://discuss.huggingface.co/t/owl-v2-bounding-box-misalignment-problem/66181/6?u=nielsr
Thanks @NielsRogge! This worked for me:
import torch
import requests
from PIL import Image, ImageDraw
from transformers import AutoProcessor, AutoModelForZeroShotObjectDetection
checkpoint="google/owlv2-base-patch16-ensemble"
model = AutoModelForZeroShotObjectDetection.from_pretrained(checkpoint)
processor = AutoProcessor.from_pretrained(checkpoint)
url = "https://unsplash.com/photos/oj0zeY2Ltk4/download?ixid=MnwxMjA3fDB8MXxzZWFyY2h8MTR8fHBpY25pY3xlbnwwfHx8fDE2Nzc0OTE1NDk&force=true&w=640"
image = Image.open(requests.get(url, stream=True).raw)
width, height = image.size
text_queries = ["hat", "book", "sunglasses", "camera"]
inputs = processor(text=text_queries, images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
target_sizes = torch.tensor([image.size[::-1]])
results = processor.post_process_object_detection(outputs, threshold=0.2, target_sizes=target_sizes)[0]
draw = ImageDraw.Draw(image)
scores = results["scores"].tolist()
labels = results["labels"].tolist()
boxes = results["boxes"].tolist()
width_ratio = 1
height_ratio = 1
if width < height:
width_ratio = width / height
elif height < width:
height_ratio = height / width
for box, score, label in zip(boxes, scores, labels):
xmin, ymin, xmax, ymax = box
xmin /= width_ratio
ymin /= height_ratio
xmax /= width_ratio
ymax /= height_ratio
draw.rectangle((xmin, ymin, xmax, ymax), outline="red", width=1)
draw.text((xmin, ymin), f"{text_queries[label]}: {round(score,2)}", fill="white")
image.show()