CLIP Laion embeddings and clip output for the same image do not match exactly

Laion embeddings and clip output for the same image do not match exactly

Open stevebottos opened this issue 1 year ago • 4 comments

The laion website provides embeddings and parquets which tie an embedding at an index in the array to its associated metadata. In theory the clip output and the laion embedding for the same image should be exactly the same, but they're not. Here's how to reproduce:

from io import BytesIO

import clip
import numpy as np
import pandas as pd
import requests
import torch
from PIL import Image
from sklearn.metrics.pairwise import cosine_similarity

# embeddings: https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/img_emb/img_emb_0.npy
# parquet: https://deploy.laion.ai/8f83b608504d46bb81708ec86e912220/embeddings/metadata/metadata_0.parquet

INDEX = 0

laion_embedding = np.load("img_emb_0.npy")[INDEX]
laion_embedding = np.expand_dims(laion_embedding, 0)
url = pd.read_parquet("metadata_0.parquet")["url"][INDEX]

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

response = requests.get(url)
image = Image.open(BytesIO(response.content))
image = preprocess(image).unsqueeze(0).to(device)

with torch.no_grad():
    clip_embedding = model.encode_image(image).cpu().numpy()


print("Pre-norm difference:", (laion_embedding - clip_embedding).sum())
clip_embedding = clip_embedding / np.linalg.norm(clip_embedding)
print("Post-norm difference:", (laion_embedding - clip_embedding).sum())
print("Cosine sim:", cosine_similarity(laion_embedding, clip_embedding))

This outputs:

Pre-norm difference: -4.79
Post-norm difference: 0.1592
Cosine sim: [[0.91325109]]

I've verified that laion uses the ViT-B/32 backbone as well. I'm wondering what might be causing the discrepancy here. Any ideas?

May 04 '23 17:05 stevebottos

This issue is related to this issue: https://github.com/rom1504/clip-retrieval/discussions/100 which didn't seem to be resolved. I was able to replicate the results in this issue, but it's not a perfect solution.

May 04 '23 17:05 stevebottos

Another comparable problem is that the text embedding on the LAION400 does not match the output of the CLIP-ViT-B/32 text-encoder.

May 17 '23 03:05 TianRui-Song717

Another comparable problem is that the text embedding on the LAION400 does not match the output of the CLIP-ViT-B/32 text-encoder.

@stevebottos The text embedding should not be affected by image resizing ....

May 17 '23 03:05 TianRui-Song717

@TianRui-Song717 That's strange, any luck discovering why?

May 17 '23 18:05 stevebottos

CLIP CLIP copied to clipboard

Laion embeddings and clip output for the same image do not match exactly

CLIP
CLIP copied to clipboard