BLIP
BLIP copied to clipboard
How to use BLIP for duplicate or near-duplicate images?
Given pair of images, my use case is to detect whether they are duplicate or not.
(imageX, imageY) = verdict/score verdict = duplicate/not duplicate/near duplicate
How can I use BLIP for this use case?
You can compute the cosine similarity of their image embeddings
For reference, there are basic tools to find duplicates: https://github.com/idealo/imagededup
You can compute the cosine similarity of their image embeddings
@LiJunnan1992 do you have a example how to extract image embeddings?
I don't find any example here https://github.com/salesforce/BLIP/blob/main/demo.ipynb
Please refer to this code in the demo: image_feature = model(image, caption, mode='image')[0,0]
@LiJunnan1992 as I mentioned for my case is give two images, I have to detect whether they are duplicate or not.
For this I have to get the embeddings of two images and then compute the cosine similarity.
But the code sample in the demo also have caption involved:
image_feature = model(image, caption, mode='image')[0,0]
As I mention I only want to get an embedding given an image. Is it possible with this model?
@smith-co I tested with the following:
from models.blip import blip_feature_extractor
from torch import nn
image_size = 224
image = load_demo_image(image_size=image_size, device=device)
model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth'
model = blip_feature_extractor(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)
caption = 'a woman sitting on the beach with a dog'
multimodal_feature = model(image, caption, mode='multimodal')[0,0]
image_feature = model(image, '', mode='image')[0,0]
text_feature = model(image, caption, mode='text')[0,0]
image_with_caption = model(image, caption, mode='image')[0,0]
image_without_caption = model(image, '', mode='image')[0,0]
cos = nn.CosineSimilarity(dim=0)
score = cos(image_with_caption, image_without_caption)
print(score)
Output:
tensor(1., grad_fn=<DivBackward0>)
As you see cosine similarity is coming up as 1
. So once you query the model with image
i.e. model(image, caption, mode='image')[0,0]
, it only gives embedding for image without the caption. At least thats what I observe in the above code snippet.
But sure, @LiJunnan1992 could provide more authoritative feedback in case I am missing something.