ImageBind
ImageBind copied to clipboard
Is OK to use cosine_similarity instead softmax for VISION x TEXT ?
Hey,
I just want to know if the cosine_similarity of sklearn can relplace the softmax.
Thanks
I'm also confused about the example codes, but I guess it's ok. It looks like that softmax here is just used to demonstrate the similarity between different modalities provided by ImageBind.
Yeah, agree with @DaNious, tbh, I did the similarity
comparison with the cosine_similarity
, and also get a proper result.
The example with "softmax" could, maybe I was wrong, people will get confused with the "activation" concept during NN forward, which represents the probability.
@XinyueZ, I am interested in computing the text x image similarity. But I am confused on how to do this. Could you please share an example code? Much appreciated. Thank you!
class SimilarityCalculator:
def __init__(self, device):
self.device = device
def __call__(self, embeddings):
raise NotImplementedError
@staticmethod
def create_instance(similarity_type, device):
if similarity_type == SimilarityCalculatorType.COSINE:
return CosineSimilarity(device)
elif similarity_type == SimilarityCalculatorType.SOFTMAX:
return SoftmaxSimilarity(device)
else:
raise ValueError(f"Unknown similarity type: {similarity_type}")
class CosineSimilarity(SimilarityCalculator):
def __init__(self, device):
super().__init__(device)
def __call__(self, embeddings):
preds = cosine_similarity(
embeddings[ModalityType.VISION].cpu().numpy(),
embeddings[ModalityType.TEXT].cpu().numpy(),
)
preds = torch.from_numpy(preds).to(self.device)
print(
preds.shape,
"\n",
preds,
"max: ",
preds.max(dim=0),
"min: ",
preds.min(dim=0),
)
return preds
class SoftmaxSimilarity(SimilarityCalculator):
def __init__(self, device):
super().__init__(device)
def __call__(self, embeddings):
preds = torch.softmax(
embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T,
dim=0,
)
print(
preds.shape,
"\n",
preds,
"sum: ",
preds.sum(dim=0),
"max: ",
preds.max(dim=0),
"min: ",
preds.min(dim=0),
)
return preds
def _infer(self, vision, text, threshold=0.5):
inputs = {
ModalityType.VISION: ImageBindClassifier._load_and_transform_vision_data_np(
vision,
self.device + ":1" if self.device == "cuda" else self.device,
),
ModalityType.TEXT: ImageBindClassifier._load_and_transform_text(
text,
self.bpe_path,
self.device + ":0" if self.device == "cuda" else self.device,
),
}
with torch.no_grad():
embeddings = self.model(inputs)
preds = self.similarity_calculator(embeddings) # softmax OR from sklearn.metrics.pairwise import cosine_similarity
return preds
@bakachan19
Thank you so so much @XinyueZ.