CLIP
CLIP copied to clipboard
Laion embeddings and clip output for the same image do not match exactly
The laion website provides embeddings and parquets which tie an embedding at an index in the array to its associated metadata. In theory the clip output and the laion embedding for the same image should be exactly the same, but they're not. Here's how to reproduce:
from io import BytesIO
import clip
import numpy as np
import pandas as pd
import requests
import torch
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity
# embeddings: https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/img_emb_0.npy
# parquet: https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/metadata/metadata_0.parquet
INDEX = 0
laion_embedding = np.load("img_emb_0.npy")[INDEX]
laion_embedding = np.expand_dims(laion_embedding, 0)
url = pd.read_parquet("metadata_0.parquet")["url"][INDEX]
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
response = requests.get(url)
image = Image.open(BytesIO(response.content))
image = preprocess(image).unsqueeze(0).to(device)
with torch.no_grad():
clip_embedding = model.encode_image(image).cpu().numpy()
print("Pre-norm difference:", (laion_embedding - clip_embedding).sum())
clip_embedding = clip_embedding / np.linalg.norm(clip_embedding)
print("Post-norm difference:", (laion_embedding - clip_embedding).sum())
print("Cosine sim:", cosine_similarity(laion_embedding, clip_embedding))
This outputs:
Pre-norm difference: -4.79
Post-norm difference: 0.1592
Cosine sim: [[0.91325109]]
I've verified that laion uses the ViT-B/32 backbone as well. I'm wondering what might be causing the discrepancy here. Any ideas?
This issue is related to this issue: https://github.com/rom1504/clip-retrieval/discussions/100 which didn't seem to be resolved. I was able to replicate the results in this issue, but it's not a perfect solution.
Another comparable problem is that the text embedding on the LAION400 does not match the output of the CLIP-ViT-B/32 text-encoder.
Another comparable problem is that the text embedding on the LAION400 does not match the output of the CLIP-ViT-B/32 text-encoder.
@stevebottos The text embedding should not be affected by image resizing ....
@TianRui-Song717 That's strange, any luck discovering why?