transformers
transformers copied to clipboard
HF CLIP image features different from OpenAI CLIP image features
System Info
python3.8, CUDA 12.1, Ubuntu20.02, latest clip, transformers==4.26.1
Who can help?
@amyeroberts
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
url = "https://canary.contestimg.wish.com/api/webimage/61b241a3a4ee2ecaf2f63c77-large.jpg?cache_buster=bbeee1fdb460a1d12bc266824914e030"
# get HF image fearures
from PIL import Image
import requests
from transformers import CLIPProcessor, CLIPModel
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
outputs = model.get_image_features(**inputs)
pooled_output_hf = outputs.detach().cpu().numpy()
# get OpenAI image features
import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
image = preprocess(Image.open(requests.get(url, stream=True).raw)).unsqueeze(0).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
pooled_output_clip = image_features.detach().cpu().numpy()
# check difference
assert np.allclose(pooled_output_hf, pooled_output_clip, atol=0.1), "hf and clip too different"
Expected behavior
HF CLIP should be close to OpenAI CLIP but they differ more than 0.1
Hi @junwang-wish, thanks for reporting this issue and the detailed reproduction script. I'll dig into this to find where the differences are coming from.
Thanks @amyeroberts , due to the significant difference would u recommend me to use HF clip or OpenAI clip based on your domain expertise?
@junwang-wish I managed to track down difference in values down to a slight difference in how the images are cropped during processing. The cropping in the feature extractor changed with #17628 - which resulted in the position of the occasionally being 1 pixel to the left or up from the OpenAI implementation. The PR #22608 aims to address this. Checking this update on the repro example in this issue, I can confirm the OpenAI and HF CLIP models return equivalent outputs again.
In terms of which to use, it depends on what you wish to use the model for. As the difference is arising from preprocessing, rather than the models themselves, provided the same image is passed in there shouldn't be any significant difference in outputs and I'd recommend whichever model fits best within your workflow.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@amyeroberts @junwang-wish Hi I have the same issue with transformer==4.30.2.
I found the preprocessing makes the difference. I tried 3 different ways to do the preprocessing and only the 3. from Openai's implementation keep the correct results.
- Use CLIPFeatureExtractor
-
tform = transforms.Compose([ transforms.ToTensor(), transforms.Resize( (224, 224), interpolation=transforms.InterpolationMode.BICUBIC, antialias=False, ), transforms.Normalize( [0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]), ])
- from openai's original preprocessing.
x = kornia.geometry.resize(x, (224, 224), interpolation='bicubic', align_corners=True, antialias=False) x = (x + 1.) / 2. x = kornia.enhance.normalize(x, torch.Tensor([0.48145466, 0.4578275, 0.40821073]), torch.Tensor([0.26862954, 0.26130258, 0.27577711]))
I'm wondering if this will be fixed in a newer version or the repo isn't trying to keep exact the same results with openai's CLIP. Thanks.
@rafaelpadilla if you have time to look into this would be awesome!
Investigating this issue and the proposed example, I found that the resulting image produced by HF is shifted up in 1 pixel in comparison to the transformation used by OpenAI (torchvision.transforms.CropCenter
) as presented here.
This happens because our center_crop
function does not behave as torchvision.transforms.CropCenter
if orig_height - crop_height
is odd or if orig_width - crop_width
is odd.
I have worked on a solution in this PR #26238 . This is a general solution, making our center_crop
behave like torchvision.transforms.CropCenter
, impacting all other modules that call crop_center
.
However, an older PR #22608 seems to address the same issue with a new crop
transformation in the image_transforms.py
, allowing changes in the center_crop
in CLIP's image processing only. I'm working to see how this PR may impact other models and will leave my review there.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.