ImageBind
ImageBind copied to clipboard
help with embedding arithmetic and image retrieval
Hi, Thanks for your great work. I am interested in the embedding arithmetic and image retrieval, as the example shown in Figure 4 of the paper.
In the paper, the embedding arithmetic is described as follows:
For arithmetic, we again use the
embedding features after temperature scaling. We ℓ2 normalize the features and sum the embeddings after scaling
them by 0.5. We use the combined feature to perform nearest neighbor retrieval using cosine distance, as described
above.
To obtain the embedding features after temperature scaling
can I just use the following code?:
########## - step 1 - ##########
# Load data
inputs = {
ModalityType.TEXT: data.load_and_transform_text(text_list, device),
ModalityType.VISION: data.load_and_transform_vision_data(image_paths, device),
ModalityType.AUDIO: data.load_and_transform_audio_data(audio_paths, device),
}
with torch.no_grad():
embeddings = model(inputs)
which applies normalization and temperature scaling for each modality (with except for the image modality where it only applies normalization) or should I modify the way the embeddings are returned by removing the normalization part and only do temperature scaling? https://github.com/facebookresearch/ImageBind/blob/38a9132636f6ca2acdd6bb3d3c10be5859488f59/models/imagebind_model.py#LL422C1-L424C10
After obtaining the embedding features after temperature scaling
, do I need to apply another ℓ2 normalization
, something like:
########## - step 2 - ##########
img_embedding = embeddings[ModalityType.VISION]
txt_embedding = embeddings[ModalityType.TEXT]
img_embedding = img_embedding / torch.norm(img_embedding, dim=-1, keepdim=True)
txt_embedding = txt_embedding / torch.norm(txt_embedding, dim=-1, keepdim=True)
and then combine the embeddings of the two modalities?:
combined_embs = 0.5* img_embedding + 0.5* txt_embedding
Then, I just use the combined_embs
and compute the cosine similarity with the embeddings of a set of images (extracted with step-1) that I want to retrieve images from?
I apologize for the long post. I greatly appreciate any tips and advice on how to approach this issue.
Many thanks!
I would also like to hear the authors opinion on this.
Same here
@gorjanradevski , @SenmiaoORZ did you guys have perhaps any new insights regarding this? I'm still curious about it. thank you.