transformers HF CLIP image features different from OpenAI CLIP image features

System Info

python3.8, CUDA 12.1, Ubuntu20.02, latest clip, transformers==4.26.1

Who can help?

@amyeroberts

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

url = "https://canary.contestimg.wish.com/api/webimage/61b241a3a4ee2ecaf2f63c77-large.jpg?cache_buster=bbeee1fdb460a1d12bc266824914e030"

# get HF image fearures
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(images=image, return_tensors="pt")

outputs = model.get_image_features(**inputs)
pooled_output_hf = outputs.detach().cpu().numpy()

# get OpenAI image features
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0).to(device)


with torch.no_grad():
   image_features = model.encode_image(image)
pooled_output_clip = image_features.detach().cpu().numpy()

# check difference
assert np.allclose(pooled_output_hf, pooled_output_clip, atol=0.1), "hf and clip too different"

Expected behavior

HF CLIP should be close to OpenAI CLIP but they differ more than 0.1

Apr 01 '23 03:04 junwang-wish

Hi @junwang-wish, thanks for reporting this issue and the detailed reproduction script. I'll dig into this to find where the differences are coming from.

Apr 03 '23 15:04 amyeroberts

Thanks @amyeroberts , due to the significant difference would u recommend me to use HF clip or OpenAI clip based on your domain expertise?

Apr 06 '23 08:04 junwang-wish

@junwang-wish I managed to track down difference in values down to a slight difference in how the images are cropped during processing. The cropping in the feature extractor changed with #17628 - which resulted in the position of the occasionally being 1 pixel to the left or up from the OpenAI implementation. The PR #22608 aims to address this. Checking this update on the repro example in this issue, I can confirm the OpenAI and HF CLIP models return equivalent outputs again.

In terms of which to use, it depends on what you wish to use the model for. As the difference is arising from preprocessing, rather than the models themselves, provided the same image is passed in there shouldn't be any significant difference in outputs and I'd recommend whichever model fits best within your workflow.

Apr 07 '23 11:04 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 25 '23 15:07 github-actions[bot]

@amyeroberts @junwang-wish Hi I have the same issue with transformer==4.30.2.

I found the preprocessing makes the difference. I tried 3 different ways to do the preprocessing and only the 3. from Openai's implementation keep the correct results.

Use CLIPFeatureExtractor
tform = transforms.Compose([ transforms.ToTensor(), transforms.Resize( (224, 224), interpolation=transforms.InterpolationMode.BICUBIC, antialias=False, ), transforms.Normalize( [0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]), ])
from openai's original preprocessing. x = kornia.geometry.resize(x, (224, 224), interpolation='bicubic', align_corners=True, antialias=False) x = (x + 1.) / 2. x = kornia.enhance.normalize(x, torch.Tensor([0.48145466, 0.4578275, 0.40821073]), torch.Tensor([0.26862954, 0.26130258, 0.27577711]))

I'm wondering if this will be fixed in a newer version or the repo isn't trying to keep exact the same results with openai's CLIP. Thanks.

Jul 25 '23 16:07 kxhit

@rafaelpadilla if you have time to look into this would be awesome!

Sep 15 '23 15:09 ArthurZucker

Investigating this issue and the proposed example, I found that the resulting image produced by HF is shifted up in 1 pixel in comparison to the transformation used by OpenAI (torchvision.transforms.CropCenter) as presented here.

This happens because our center_crop function does not behave as torchvision.transforms.CropCenter if orig_height - crop_height is odd or if orig_width - crop_width is odd.

I have worked on a solution in this PR #26238 . This is a general solution, making our center_crop behave like torchvision.transforms.CropCenter, impacting all other modules that call crop_center.

However, an older PR #22608 seems to address the same issue with a new crop transformation in the image_transforms.py, allowing changes in the center_crop in CLIP's image processing only. I'm working to see how this PR may impact other models and will leave my review there.

Sep 19 '23 02:09 rafaelpadilla

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Dec 03 '23 08:12 github-actions[bot]

transformers transformers copied to clipboard

HF CLIP image features different from OpenAI CLIP image features

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard