GroundingDINO
GroundingDINO copied to clipboard
Grounding DINO is now available in 🤗 Transformers!
Hi folks!
Grounding DINO is now available in the Transformers library, enabling easy inference in a few lines of code.
Here's how to use it:
from transformers import AutoProcessor, GroundingDinoForObjectDetection
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
text = "a cat."
processor = AutoProcessor.from_pretrained("IDEA-Research/grounding-dino-tiny")
model = GroundingDinoForObjectDetection.from_pretrained("IDEA-Research/grounding-dino-tiny")
inputs = processor(images=image, text=text, return_tensors="pt")
outputs = model(**inputs)
# convert outputs (bounding boxes and class logits) to COCO API
target_sizes = torch.tensor([image.size[::-1]])
results = processor.image_processor.post_process_object_detection(
outputs, threshold=0.35, target_sizes=target_sizes
)[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
box = [round(i, 2) for i in box.tolist()]
print(f"Detected {label.item()} with confidence " f"{round(score.item(), 3)} at location {box}")
Demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Grounding%20DINO/Inference_with_Grounding_DINO_for_zero_shot_object_detection.ipynb
Checkpoints are on the hub: https://huggingface.co/models?other=grounding-dino
Relevant for https://github.com/IDEA-Research/GroundingDINO/issues/88
Thanks for your work! I found an issue where the same model produces different results. Take the image of '000000039769.jpg' as an example. The results of the official code are significantly better than those of the transformer library
The results you report are
Detected 1 with confidence 0.45 at location [344.8, 23.2, 637.4, 373.8]
Detected 1 with confidence 0.41 at location [11.9, 51.6, 316.6, 472.9]
My results based on the transformer package Code: from transformers import AutoProcessor, GroundingDinoForObjectDetection from PIL import Image import requests import torch
import matplotlib.pyplot as plt import matplotlib.patches as patches
image = Image.open('000000039769.jpg')
text = "cat" device = 'cpu' processor = AutoProcessor.from_pretrained("IDEA-Research/grounding-dino-tiny") model = GroundingDinoForObjectDetection.from_pretrained("IDEA-Research/grounding-dino-tiny").to(device) inputs = processor(images=image, text=text, return_tensors="pt").to(device) outputs = model(**inputs)
target_sizes = torch.tensor([image.size[::-1]]) results = processor.image_processor.post_process_object_detection( outputs, threshold=0.2, target_sizes=target_sizes )[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): box = [round(i, 2) for i in box.tolist()] print(f"Detected {label.item()} with confidence " f"{round(score.item(), 3)} at location {box}")
Results: Detected 1 with confidence 0.26 at location [40.29, 72.75, 175.84, 117.19]
The results based on the official code and colab.
import os
CONFIG_PATH = os.path.join(HOME, "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py") print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH)) WEIGHTS_NAME = "groundingdino_swint_ogc.pth" WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME) print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))
from groundingdino.util.inference import load_model, load_image, predict, annotate
model = load_model(CONFIG_PATH, WEIGHTS_PATH) import os import supervision as sv
IMAGE_NAME = "000000039769.jpg" IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)
TEXT_PROMPT = "cat" BOX_TRESHOLD = 0.35 TEXT_TRESHOLD = 0.25
image_source, image = load_image(IMAGE_PATH)
boxes, logits, phrases = predict( model=model, image=image, caption=TEXT_PROMPT, box_threshold=BOX_TRESHOLD, text_threshold=TEXT_TRESHOLD )
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
%matplotlib inline
sv.plot_image(annotated_frame, (16, 16))
Pinging @eduardopach here
Thanks for your work! I found an issue where the same model produces different results. Take the image of '000000039769.jpg' as an example. The results of the official code are significantly better than those of the transformer library
The results you report are Detected 1 with confidence 0.45 at location [344.8, 23.2, 637.4, 373.8] Detected 1 with confidence 0.41 at location [11.9, 51.6, 316.6, 472.9]
My results based on the transformer package Code: from transformers import AutoProcessor, GroundingDinoForObjectDetection from PIL import Image import requests import torch
import matplotlib.pyplot as plt import matplotlib.patches as patches
image = Image.open('000000039769.jpg')
text = "cat" device = 'cpu' processor = AutoProcessor.from_pretrained("IDEA-Research/grounding-dino-tiny") model = GroundingDinoForObjectDetection.from_pretrained("IDEA-Research/grounding-dino-tiny").to(device) inputs = processor(images=image, text=text, return_tensors="pt").to(device) outputs = model(**inputs)
target_sizes = torch.tensor([image.size[::-1]]) results = processor.image_processor.post_process_object_detection( outputs, threshold=0.2, target_sizes=target_sizes )[0]
for score, label, box in zip(results["scores"], results["labels"], results["boxes"]): box = [round(i, 2) for i in box.tolist()] print(f"Detected {label.item()} with confidence " f"{round(score.item(), 3)} at location {box}")
Results: Detected 1 with confidence 0.26 at location [40.29, 72.75, 175.84, 117.19]
The results based on the official code and colab.
import os
CONFIG_PATH = os.path.join(HOME, "GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py") print(CONFIG_PATH, "; exist:", os.path.isfile(CONFIG_PATH)) WEIGHTS_NAME = "groundingdino_swint_ogc.pth" WEIGHTS_PATH = os.path.join(HOME, "weights", WEIGHTS_NAME) print(WEIGHTS_PATH, "; exist:", os.path.isfile(WEIGHTS_PATH))
from groundingdino.util.inference import load_model, load_image, predict, annotate
model = load_model(CONFIG_PATH, WEIGHTS_PATH) import os import supervision as sv
IMAGE_NAME = "000000039769.jpg" IMAGE_PATH = os.path.join(HOME, "data", IMAGE_NAME)
TEXT_PROMPT = "cat" BOX_TRESHOLD = 0.35 TEXT_TRESHOLD = 0.25
image_source, image = load_image(IMAGE_PATH)
boxes, logits, phrases = predict( model=model, image=image, caption=TEXT_PROMPT, box_threshold=BOX_TRESHOLD, text_threshold=TEXT_TRESHOLD )
annotated_frame = annotate(image_source=image_source, boxes=boxes, logits=logits, phrases=phrases)
%matplotlib inline sv.plot_image(annotated_frame, (16, 16))
Actually, the output is the same :D, there's only one catch in your example.
The original implementation takes your text prompt and adds a .
at the end so what is happening is that for the transformers
example, you're passing "cat"
and in the original, you're passing "cat."
and this causes the difference you're seeing
it's possible to train?
Think you!
it's possible to train?
In theory, you can, it wasn't extensively tested, but if you find any problems open an issue in the transformers repo and tag me there