BLIP icon indicating copy to clipboard operation
BLIP copied to clipboard

How to use BLIP for duplicate or near-duplicate images?

Open smith-co opened this issue 2 years ago • 6 comments

Given pair of images, my use case is to detect whether they are duplicate or not.

(imageX, imageY) = verdict/score verdict = duplicate/not duplicate/near duplicate

How can I use BLIP for this use case?

smith-co avatar Jun 29 '22 02:06 smith-co

You can compute the cosine similarity of their image embeddings

LiJunnan1992 avatar Jun 29 '22 03:06 LiJunnan1992

For reference, there are basic tools to find duplicates: https://github.com/idealo/imagededup

woctezuma avatar Jun 29 '22 06:06 woctezuma

You can compute the cosine similarity of their image embeddings

@LiJunnan1992 do you have a example how to extract image embeddings?

I don't find any example here https://github.com/salesforce/BLIP/blob/main/demo.ipynb

smith-co avatar Jun 29 '22 06:06 smith-co

Please refer to this code in the demo: image_feature = model(image, caption, mode='image')[0,0]

LiJunnan1992 avatar Jun 29 '22 08:06 LiJunnan1992

@LiJunnan1992 as I mentioned for my case is give two images, I have to detect whether they are duplicate or not.

For this I have to get the embeddings of two images and then compute the cosine similarity.

But the code sample in the demo also have caption involved:

image_feature = model(image, caption, mode='image')[0,0]

As I mention I only want to get an embedding given an image. Is it possible with this model?

smith-co avatar Jun 29 '22 19:06 smith-co

@smith-co I tested with the following:

from models.blip import blip_feature_extractor
from torch import nn
image_size = 224
image = load_demo_image(image_size=image_size, device=device)     

model_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth'
    
model = blip_feature_extractor(pretrained=model_url, image_size=image_size, vit='base')
model.eval()
model = model.to(device)

caption = 'a woman sitting on the beach with a dog'

multimodal_feature = model(image, caption, mode='multimodal')[0,0]
image_feature = model(image, '', mode='image')[0,0]
text_feature = model(image, caption, mode='text')[0,0]

image_with_caption = model(image, caption, mode='image')[0,0]
image_without_caption = model(image, '', mode='image')[0,0]

cos = nn.CosineSimilarity(dim=0)
score = cos(image_with_caption, image_without_caption)
print(score)

Output:

tensor(1., grad_fn=<DivBackward0>)

As you see cosine similarity is coming up as 1. So once you query the model with image i.e. model(image, caption, mode='image')[0,0], it only gives embedding for image without the caption. At least thats what I observe in the above code snippet.

But sure, @LiJunnan1992 could provide more authoritative feedback in case I am missing something.

nashid avatar Jun 29 '22 20:06 nashid