supervision
supervision copied to clipboard
feat(detections): ✨ OWLv2 and OWL-ViT inference detection 'from_owl' added
Description
Title: Feature: OWLv2 and OWL-ViT Inference Detection Method 'from_owl' Added
Description:
This commit introduces a new feature in the Detections class, a method named from_owl. This method is designed to create a Detections instance from the inference results of OWLv2 and OWL-ViT models.
The from_owl method takes as input a list of results from OWLv2 and OWL-ViT inference and returns a new Detections object. The method checks if the first element of the input list has any bounding box predictions. If there are no predictions, it returns an empty Detections instance. Otherwise, it creates a new Detections instance with the bounding boxes, confidence scores, and class labels from the inference results.
This feature enhances the functionality of the Detections class by providing a way to directly create a Detections instance from OWLv2 and OWL-ViT inference results, making it easier for users to work with these models.
Changes:
- Added
from_owlmethod inDetectionsclass incore.py.
This feature is a step forward in expanding the capabilities of our software to work seamlessly with OWLv2 and OWL-ViT models.
List any dependencies that are required for this change.
Type of change
- [X] New feature (non-breaking change which adds functionality)
- [X] This change requires a documentation update
How has this change been tested, please provide a testcase or example of how you tested the change?
Docs
- [X] Docs updated? What were the changes:
Google Collab Link for test
- https://colab.research.google.com/drive/1RhouO-Et4u_03SU4qURiH5woepEBTJsx?usp=sharing
Test Case with OwlViT
import requests
from PIL import Image
import torch
import supervision as sv
import numpy as np
import cv2
from transformers import OwlViTProcessor, OwlViTForObjectDetection
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch16")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
detections = sv.Detections.from_owl(results)
box_annotator = sv.BoundingBoxAnnotator()
cv2_image = np.array(image.convert("RGB"))[:, :, ::-1].copy()
img = box_annotator.annotate(cv2_image, detections=detections)
cv2.imwrite("owl-vit-test.jpg", img)
Test Case with Owlv2
import requests
from PIL import Image
import torch
from transformers import Owlv2Processor, Owlv2ForObjectDetection
import supervision as sv
import numpy as np
import cv2
processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
i = 0 # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
print(boxes.detach().numpy())
for box, score, label in zip(boxes, scores, labels):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")
detections = sv.Detections.from_owl(results)
box_annotator = sv.BoundingBoxAnnotator()
cv2_image = np.array(image.convert("RGB"))[:, :, ::-1].copy()
img = box_annotator.annotate(cv2_image, detections=detections)
cv2.imwrite("owlv2test.jpg", img)
Can you use from_transformers for this? https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers
Can you use
from_transformersfor this? https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers
No, it does not work
It feels confusing that we have two functions that have almost exactly the same code and are loading from HuggingFace Transformers models. Is there a broader pattern we are missing for loading them? I would expect from_transformers to work with any Transformers model.
from_owl
return cls(
xyxy=owl_result[0]["boxes"].detach().numpy(),
confidence=owl_result[0]["scores"].detach().numpy(),
class_id=owl_result[0]["labels"].detach().numpy().astype(int),
)
from_transformers
Source: https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers
return cls(
xyxy=transformers_results["boxes"].cpu().numpy(),
confidence=transformers_results["scores"].cpu().numpy(),
class_id=transformers_results["labels"].cpu().numpy().astype(int),
)
It feels confusing that we have two functions that have almost exactly the same code and are loading from HuggingFace Transformers models. Is there a broader pattern we are missing for loading them? I would expect
from_transformersto work with any Transformers model.from_owl
return cls( xyxy=owl_result[0]["boxes"].detach().numpy(), confidence=owl_result[0]["scores"].detach().numpy(), class_id=owl_result[0]["labels"].detach().numpy().astype(int), )from_transformers
Source: https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers
return cls( xyxy=transformers_results["boxes"].cpu().numpy(), confidence=transformers_results["scores"].cpu().numpy(), class_id=transformers_results["labels"].cpu().numpy().astype(int), )
I thought they are same but when I load them it did not work so I created, but based on what you said I will check transformers side