supervision feat(detections): ✨ OWLv2 and OWL-ViT inference detection 'from

trafficstars

Description

Title: Feature: OWLv2 and OWL-ViT Inference Detection Method 'from_owl' Added

Description:

This commit introduces a new feature in the Detections class, a method named from_owl. This method is designed to create a Detections instance from the inference results of OWLv2 and OWL-ViT models.

The from_owl method takes as input a list of results from OWLv2 and OWL-ViT inference and returns a new Detections object. The method checks if the first element of the input list has any bounding box predictions. If there are no predictions, it returns an empty Detections instance. Otherwise, it creates a new Detections instance with the bounding boxes, confidence scores, and class labels from the inference results.

This feature enhances the functionality of the Detections class by providing a way to directly create a Detections instance from OWLv2 and OWL-ViT inference results, making it easier for users to work with these models.

Changes:

Added from_owl method in Detections class in core.py.

This feature is a step forward in expanding the capabilities of our software to work seamlessly with OWLv2 and OWL-ViT models.

List any dependencies that are required for this change.

Type of change

[X] New feature (non-breaking change which adds functionality)
[X] This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Docs

[X] Docs updated? What were the changes:

Google Collab Link for test

https://colab.research.google.com/drive/1RhouO-Et4u_03SU4qURiH5woepEBTJsx?usp=sharing

Dec 19 '23 09:12 onuralpszr

Test Case with OwlViT

import requests
from PIL import Image
import torch

import supervision as sv
import numpy as np
import cv2

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")


detections = sv.Detections.from_owl(results)
box_annotator = sv.BoundingBoxAnnotator()
cv2_image = np.array(image.convert("RGB"))[:, :, ::-1].copy()
img = box_annotator.annotate(cv2_image, detections=detections)
cv2.imwrite("owl-vit-test.jpg", img)

Dec 19 '23 09:12 onuralpszr

Test Case with Owlv2

import requests
from PIL import Image
import torch
from transformers import Owlv2Processor, Owlv2ForObjectDetection

import supervision as sv
import numpy as np
import cv2

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
print(boxes.detach().numpy())
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")


detections = sv.Detections.from_owl(results)
box_annotator = sv.BoundingBoxAnnotator()
cv2_image = np.array(image.convert("RGB"))[:, :, ::-1].copy()
img = box_annotator.annotate(cv2_image, detections=detections)
cv2.imwrite("owlv2test.jpg", img)

Dec 19 '23 10:12 onuralpszr

Can you use from_transformers for this? https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

Dec 21 '23 15:12 capjamesg

Can you use from_transformers for this? https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

No, it does not work

Dec 21 '23 18:12 onuralpszr

It feels confusing that we have two functions that have almost exactly the same code and are loading from HuggingFace Transformers models. Is there a broader pattern we are missing for loading them? I would expect from_transformers to work with any Transformers model.

from_owl

        return cls(
            xyxy=owl_result[0]["boxes"].detach().numpy(),
            confidence=owl_result[0]["scores"].detach().numpy(),
            class_id=owl_result[0]["labels"].detach().numpy().astype(int),
        )

from_transformers

Source: https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

    return cls(
        xyxy=transformers_results["boxes"].cpu().numpy(),
        confidence=transformers_results["scores"].cpu().numpy(),
        class_id=transformers_results["labels"].cpu().numpy().astype(int),
    )

Dec 21 '23 19:12 capjamesg

It feels confusing that we have two functions that have almost exactly the same code and are loading from HuggingFace Transformers models. Is there a broader pattern we are missing for loading them? I would expect from_transformers to work with any Transformers model.

from_owl
        return cls(
            xyxy=owl_result[0]["boxes"].detach().numpy(),
            confidence=owl_result[0]["scores"].detach().numpy(),
            class_id=owl_result[0]["labels"].detach().numpy().astype(int),
        )
from_transformers

Source: https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers
    return cls(
        xyxy=transformers_results["boxes"].cpu().numpy(),
        confidence=transformers_results["scores"].cpu().numpy(),
        class_id=transformers_results["labels"].cpu().numpy().astype(int),
    )

I thought they are same but when I load them it did not work so I created, but based on what you said I will check transformers side

Dec 21 '23 20:12 onuralpszr

supervision
supervision copied to clipboard

feat(detections): ✨ OWLv2 and OWL-ViT inference detection 'from_owl' added

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Docs

Google Collab Link for test

from_owl

from_transformers

from_owl

from_transformers

supervision supervision copied to clipboard

feat(detections): ✨ OWLv2 and OWL-ViT inference detection 'from_owl' added

Description

Type of change

How has this change been tested, please provide a testcase or example of how you tested the change?

Docs

Google Collab Link for test

from_owl

from_transformers

from_owl

from_transformers

supervision
supervision copied to clipboard