supervision icon indicating copy to clipboard operation
supervision copied to clipboard

feat(detections): ✨ OWLv2 and OWL-ViT inference detection 'from_owl' added

Open onuralpszr opened this issue 1 year ago • 6 comments
trafficstars

Description

Title: Feature: OWLv2 and OWL-ViT Inference Detection Method 'from_owl' Added

Description:

This commit introduces a new feature in the Detections class, a method named from_owl. This method is designed to create a Detections instance from the inference results of OWLv2 and OWL-ViT models.

The from_owl method takes as input a list of results from OWLv2 and OWL-ViT inference and returns a new Detections object. The method checks if the first element of the input list has any bounding box predictions. If there are no predictions, it returns an empty Detections instance. Otherwise, it creates a new Detections instance with the bounding boxes, confidence scores, and class labels from the inference results.

This feature enhances the functionality of the Detections class by providing a way to directly create a Detections instance from OWLv2 and OWL-ViT inference results, making it easier for users to work with these models.

Changes:

  • Added from_owl method in Detections class in core.py.

This feature is a step forward in expanding the capabilities of our software to work seamlessly with OWLv2 and OWL-ViT models.

List any dependencies that are required for this change.

Type of change

  • [X] New feature (non-breaking change which adds functionality)
  • [X] This change requires a documentation update

How has this change been tested, please provide a testcase or example of how you tested the change?

Docs

  • [X] Docs updated? What were the changes:

Google Collab Link for test

  • https://colab.research.google.com/drive/1RhouO-Et4u_03SU4qURiH5woepEBTJsx?usp=sharing

onuralpszr avatar Dec 19 '23 09:12 onuralpszr

Test Case with OwlViT

import requests
from PIL import Image
import torch

import supervision as sv
import numpy as np
import cv2

from transformers import OwlViTProcessor, OwlViTForObjectDetection

processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch16")
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch16")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to COCO API
results = processor.post_process_object_detection(outputs=outputs, threshold=0.1, target_sizes=target_sizes)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

# Print detected objects and rescaled box coordinates
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")


detections = sv.Detections.from_owl(results)
box_annotator = sv.BoundingBoxAnnotator()
cv2_image = np.array(image.convert("RGB"))[:, :, ::-1].copy()
img = box_annotator.annotate(cv2_image, detections=detections)
cv2.imwrite("owl-vit-test.jpg", img)

onuralpszr avatar Dec 19 '23 09:12 onuralpszr

Test Case with Owlv2

import requests
from PIL import Image
import torch
from transformers import Owlv2Processor, Owlv2ForObjectDetection

import supervision as sv
import numpy as np
import cv2

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]
inputs = processor(text=texts, images=image, return_tensors="pt")
outputs = model(**inputs)

# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)

i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
print(boxes.detach().numpy())
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    print(f"Detected {text[label]} with confidence {round(score.item(), 3)} at location {box}")


detections = sv.Detections.from_owl(results)
box_annotator = sv.BoundingBoxAnnotator()
cv2_image = np.array(image.convert("RGB"))[:, :, ::-1].copy()
img = box_annotator.annotate(cv2_image, detections=detections)
cv2.imwrite("owlv2test.jpg", img)

onuralpszr avatar Dec 19 '23 10:12 onuralpszr

Can you use from_transformers for this? https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

capjamesg avatar Dec 21 '23 15:12 capjamesg

Can you use from_transformers for this? https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

No, it does not work

onuralpszr avatar Dec 21 '23 18:12 onuralpszr

It feels confusing that we have two functions that have almost exactly the same code and are loading from HuggingFace Transformers models. Is there a broader pattern we are missing for loading them? I would expect from_transformers to work with any Transformers model.

from_owl

        return cls(
            xyxy=owl_result[0]["boxes"].detach().numpy(),
            confidence=owl_result[0]["scores"].detach().numpy(),
            class_id=owl_result[0]["labels"].detach().numpy().astype(int),
        )

from_transformers

Source: https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

    return cls(
        xyxy=transformers_results["boxes"].cpu().numpy(),
        confidence=transformers_results["scores"].cpu().numpy(),
        class_id=transformers_results["labels"].cpu().numpy().astype(int),
    )

capjamesg avatar Dec 21 '23 19:12 capjamesg

It feels confusing that we have two functions that have almost exactly the same code and are loading from HuggingFace Transformers models. Is there a broader pattern we are missing for loading them? I would expect from_transformers to work with any Transformers model.

from_owl

        return cls(
            xyxy=owl_result[0]["boxes"].detach().numpy(),
            confidence=owl_result[0]["scores"].detach().numpy(),
            class_id=owl_result[0]["labels"].detach().numpy().astype(int),
        )

from_transformers

Source: https://supervision.roboflow.com/detection/core/#supervision.detection.core.Detections.from_transformers

    return cls(
        xyxy=transformers_results["boxes"].cpu().numpy(),
        confidence=transformers_results["scores"].cpu().numpy(),
        class_id=transformers_results["labels"].cpu().numpy().astype(int),
    )

I thought they are same but when I load them it did not work so I created, but based on what you said I will check transformers side

onuralpszr avatar Dec 21 '23 20:12 onuralpszr