ImageBind icon indicating copy to clipboard operation
ImageBind copied to clipboard

help with embedding arithmetic and image retrieval

Open bakachan19 opened this issue 1 year ago • 3 comments

Hi, Thanks for your great work. I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.

In the paper, the embedding arithmetic is described as follows:

For arithmetic, we again use the
embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling
them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described
above. 

To obtain the embedding features after temperature scaling can I just use the following code?:

########## - step 1 - ########## 
# Load data
inputs = {
    ModalityType.TEXT: data.load_and_transform_text(text_list, device),
    ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
    ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}

with torch.no_grad():
    embeddings = model(inputs)

which applies normalization and temperature scaling for each modality (with except for the image modality where it only applies normalization) or should I modify the way the embeddings are returned by removing the normalization part and only do temperature scaling? https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#LL422C1-L424C10

After obtaining the embedding features after temperature scaling, do I need to apply another ℓ2 normalization, something like:

########## - step 2 - ########## 
img_embedding = embeddings[ModalityType.VISION]
txt_embedding = embeddings[ModalityType.TEXT]

img_embedding = img_embedding / torch.norm(img_embedding, dim=-1, keepdim=True)
txt_embedding = txt_embedding / torch.norm(txt_embedding, dim=-1, keepdim=True)

and then combine the embeddings of the two modalities?:

combined_embs = 0.5* img_embedding + 0.5* txt_embedding

Then, I just use the combined_embs and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?

I apologize for the long post. I greatly appreciate any tips and advice on how to approach this issue.

Many thanks!

bakachan19 avatar May 23 '23 16:05 bakachan19

I would also like to hear the authors opinion on this.

gorjanradevski avatar Dec 22 '23 23:12 gorjanradevski

Same here

SenmiaoORZ avatar Dec 25 '23 07:12 SenmiaoORZ

@gorjanradevski , @SenmiaoORZ did you guys have perhaps any new insights regarding this? I'm still curious about it. thank you.

bakachan19 avatar May 22 '24 15:05 bakachan19