transformers OwlVit gives different results compared to original colab version

System Info

Using huggingface space and google colab

Who can help?

@adirik

Information

[x] The official example scripts
[x] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

cat picture from http://images.cocodataset.org/val2017/000000039769.jpg remote control image from https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSRUGcH7a3DO5Iz1sknxU5oauEq9T_q4hyU3nuTFHiO0NMSg37x

Expected behavior

Being excited with the results of OwlVit, I tried to input some random image to see the results. Having no experience on jax, my first option is to search out on huggingface space.

Given a query of remote control, and a cat picture, I wanted to get picture of remote controls. https://huggingface.co/spaces/adirik/image-guided-owlvit Screenshot 2023-01-20 at 14 13 13 The results is not really what I expected (no box on remotes).

Then I checked for results on colab version, if they behave the same way. https://colab.research.google.com/github/google-research/scenic/blob/main/scenic/projects/owl_vit/notebooks/OWL_ViT_inference_playground.ipynb#scrollTo=AQGAM16fReow Screenshot 2023-01-20 at 14 14 02 It correctly draw boxes on the remotes.

I am not sure what is happening, which part should I look at to determine what causes this difference?

Jan 20 '23 05:01 darwinharianto

Yes we had a hard time making the Space output the same bounding boxes as in Colab (eventually it worked on the cats image). It had to do with the Pillow version.

So I'm guessing there might be a difference in Pillow versions here as well

Cc @alaradirik

Jan 20 '23 18:01 NielsRogge

Do you mean Pillow changes the input value? I tried another image Screenshot 2023-01-23 at 9 41 05 space model cant detect cat inside this image, but colab version can detect it Screenshot 2023-01-23 at 9 42 07

Jan 23 '23 00:01 darwinharianto

@darwinharianto thanks for bringing the issue up, I'm looking into it!

Jan 24 '23 08:01 alaradirik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Feb 19 '23 15:02 github-actions[bot]

Kindly bumping

Feb 21 '23 06:02 darwinharianto

Kindly reminder

Mar 15 '23 09:03 MaslikovEgor

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 08 '23 15:04 github-actions[bot]

cc @alaradirik and @amyeroberts

Apr 10 '23 13:04 sgugger

I got the same issues. This is original repo results.

And this is huggingface demo.

text_queries = text_queries.split(",")
target_sizes = torch.Tensor([img.shape[:2]])
inputs = processor(text=text_queries, images=img, return_tensors="pt").to(device)  
with torch.no_grad():
    outputs = model(**inputs)

outputs.logits = outputs.logits.cpu()
outputs.pred_boxes = outputs.pred_boxes.cpu()
results = processor.post_process(outputs=outputs, target_sizes=target_sizes)

The rocket bounding box score is different. (0.15 vs more than 0.21)

With lvis-api, the performance is not reproduced. (mAP = 0.095)

Apr 22 '23 09:04 RRoundTable

It seems problem still exist. I mentioned about problem here.

https://github.com/huggingface/transformers/pull/23157#issuecomment-1540056705

Maybe the best way is to cover with model predictions end-to-end tests on batch of images. This approach help us to be sure about changes

May 09 '23 12:05 MaslikovEgor

@MaslikovEgor I agree with you. I have end-to-end test with lvis-api (both huggingface owlvit and google/scenic owl-vit). But owl vit in huggingface is not reproduced. (mAP = 0.095)

baseline: mAp 0.193

May 09 '23 14:05 RRoundTable

I want to fix this problem, but it would be efficient if I knew where to start. Can you give me a suggestion? @alaradirik

May 09 '23 14:05 RRoundTable

Hi @MaslikovEgor,

The demo didn't work before this fix as well (see https://github.com/huggingface/transformers/pull/20136). Try running coco evaluation with image conditioning before/after this fix, [email protected] increases from 6 to 37. This is still below the expected 44, but closer to the reported/expected performance. I am still trying to figure out why. Best, Orr

May 09 '23 14:05 orrzohar

@RRoundTable, the issues you are reporting seem to do with the text-conditioned evaluation. This means that the issues probably stem from the forward pass/post-processing.

In your LVIS eval, did you make sure to implement a new post-processor that incorporates all the changes needed for eval? If helpful, I can add my function to 'processor' or something, please notice there are a few changes compared with normal inference.

May 09 '23 15:05 orrzohar

@orrzohar, Yes. I tested with text-conditioned evaluation.

In my LVIS eval, I just used huggingface's postprocessor and preprocessor. It would be helpful if you contribute some functions.

transformers[torch] == 4.28.1

# example script
import requests
from PIL import Image
import torch
import glob
import os
import argparse
import json
from tqdm import tqdm

from transformers import OwlViTProcessor, OwlViTForObjectDetection

parser = argparse.ArgumentParser()
parser.add_argument("--dataset-path", type=str, required=True)
parser.add_argument("--text-query-path", type=str required=True)
parser.add_argument("--save-path", default="owl-vit-result.json", type=str)
parser.add_argument("--batch-size", default=64, type=int)
args = parser.parse_args()

model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")
processor = OwlViTProcessor.from_pretrained("google/owlvit-base-patch32")
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
model.to(device)


with open(args.text_query_path, "r") as f:
    text_query = f.read()

images = glob.glob(os.path.join(args.dataset_path, "*"))
image_ids = [img_path.split("/")[-1].split(".")[0] for img_path in images]

instances = []
N = len(images)

with torch.no_grad():
    for i in tqdm(range(N // args.batch_size + 1)):
        image_ids = []
        batch_images = []
        target_sizes = []
        for img_path in images[i * args.batch_size: (i+1) * args.batch_size]:
            image_ids.append(int(img_path.split("/")[-1].split(".")[0]))
            image = Image.open(img_path).convert("RGB")
            batch_images.append(image)
            target_sizes.append((image.size[1], image.size[0]))
        target_sizes = torch.Tensor(target_sizes)
        target_sizes = target_sizes.to(device)
        texts = [text_query.split(",")] * len(batch_images)
        inputs = processor(text=texts, images=batch_images, return_tensors="pt")
        inputs = inputs.to(device)
        outputs = model(**inputs)
        # Target image sizes (height, width) to rescale box predictions [batch_size, 2]

        # Convert outputs (bounding boxes and class logits) to COCO API
        results = processor.post_process(outputs=outputs, target_sizes=target_sizes)
        for image_id, res in zip(image_ids, results):
            for bbox, score, label in zip(res["boxes"], res["scores"], res["labels"]):
                # tensor to numpy
                bbox = bbox.cpu().detach().numpy()
                score = score.cpu().detach().numpy()
                label = label.cpu().detach().numpy()
                # bbox format: xyxy -> xywh
                x1, y1, x2, y2 = bbox
                bbox = [int(x1), int(y1), int(x2-x1), int(y2-y1)]
                instance = {}
                instance["image_id"] = image_id
                instance["bbox"] = bbox # TODO
                instance["score"] = float(score)
                instance["category_id"] = int(label) + 1 # TODO
                instances.append(instance)

May 09 '23 23:05 RRoundTable

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jun 03 '23 15:06 github-actions[bot]

Hi @RRoundTable ,

I added a PR with the appropriate evaluation protocol

https://github.com/huggingface/transformers/pull/23982

Best, Orr

Jun 04 '23 03:06 orrzohar

Hi! @alaradirik, I'm using transformers==4.30.2 but still encountered the same issue. Any thought on this?

Query image:

Result from colab:

Result from huggingface:

Jul 19 '23 03:07 haizadtarik

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sep 10 '23 08:09 github-actions[bot]

cc @rafaelpadilla

Sep 13 '23 18:09 amyeroberts

Hi folks, I've investigated the difference, will be solved in PR below. TLDR: image preprocessing is done differently in the original Colab (involves padding the image to a square), whereas the HF implementation used center cropping. The model itself is fine, logits are exactly the same as original implementation on the same inputs.

Sep 25 '23 08:09 NielsRogge

Hi folks, since OWLv2 was now added in #26668, you will see that results match one-on-one with the original Google Colab notebook provided by the authors.

If you also want to get one-on-one matching results for OWLv1, then you will need to use Owlv2Processor (which internally uses Owlv2ImageProcessor) instead of OwlViTProcessor as it uses the exact same image preprocessing settings as the Colab notebook. We cannot change this for v1 due to backwards compatibility.

Oct 14 '23 09:10 NielsRogge

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Nov 08 '23 08:11 github-actions[bot]

@RRoundTable I have trying to reproduce the results(AP) values on lvis dataset using the example script that you have provided. Did you manage to reproduce the results?

Mar 01 '24 13:03 rishabh-akridata

@NielsRogge I am using the Owlv2 processor, but still not able to get the same results.

Mar 06 '24 06:03 rishabh-akridata

@rishabh-akridata please provide a script that reproduces your issue

Mar 06 '24 06:03 NielsRogge

@NielsRogge Please find the script below.


import skimage
import os
import matplotlib.pyplot as plt
from copy import deepcopy
import numpy as np
import cv2
import requests
from PIL import Image
import torch
from transformers import OwlViTProcessor, OwlViTForObjectDetection, Owlv2Processor, Owlv2ForObjectDetection

processor = Owlv2Processor.from_pretrained("google/owlvit-base-patch32", size={"height":768, "width":768})
model = OwlViTForObjectDetection.from_pretrained("google/owlvit-base-patch32")

filename = os.path.join(skimage.data_dir, 'astronaut.png')
image = Image.open(filename)
texts = ['face', 'rocket', 'nasa badge', 'star-spangled banner']
inputs = processor(text=texts, images=image, return_tensors="pt")
# Target image sizes (height, width) to rescale box predictions [batch_size, 2]
target_sizes = torch.Tensor([image.size[::-1]])
outputs = model(**inputs)
# Convert outputs (bounding boxes and class logits) to Pascal VOC format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2)
# results = post_process_object_detection_evaluation(outputs, target_sizes=target_sizes, pred_per_im=10)
# font
font = cv2.FONT_HERSHEY_SIMPLEX
# fontScale
fontScale = 0.5
# Red color in BGR
color = (0, 255, 0)
# Line thickness of 2 px
thickness = 2

boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]
image_to_plot = deepcopy(np.array(image))
image_to_plot = image_to_plot.astype(np.uint8)
for box, score, label in zip(boxes, scores, labels):
    box = [round(i, 2) for i in box.tolist()]
    xmin, ymin, xmax, ymax = int(box[0]), int(box[1]), int(box[2]), int(box[3])
    cv2.rectangle(image_to_plot, (xmin, ymin), (xmax, ymax), (255, 0, 0), 2)
    rounded_score = round(float(score), 2)
    # Using cv2.putText() method
    cv2.putText(image_to_plot, f"{texts[label]}:{rounded_score}", (xmin, ymax), font, fontScale,
                    color, thickness, cv2.LINE_AA, False)

plt.imshow(image_to_plot)
plt.show()

download (1)

Mar 06 '24 06:03 rishabh-akridata

@NielsRogge Also I tried to use the below processor as well. But the facing the same issue. processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble", size={"height":768, "width":768})

Mar 06 '24 06:03 rishabh-akridata

@NielsRogge When I reduce the conf_threshold to 0.1, then I get some detections but with very low confidence and also the boxes are not the same as described in the official colab notebook. download

Mar 06 '24 06:03 rishabh-akridata

@NielsRogge Please ignore this one, I am looking into the results of different model variant. I am able to get the same results as mentioned in the colab notebook. Sorry for inconvenience caused.

Thanks.

Mar 06 '24 07:03 rishabh-akridata

transformers transformers copied to clipboard

OwlVit gives different results compared to original colab version

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard