ImageBind Is OK to use cosine_similarity instead softmax for VISION x TEXT ?

Hey,

I just want to know if the cosine_similarity of sklearn can relplace the softmax.

Thanks

Jun 19 '23 11:06 XinyueZ

I'm also confused about the example codes, but I guess it's ok. It looks like that softmax here is just used to demonstrate the similarity between different modalities provided by ImageBind.

Jul 05 '23 07:07 DaNious

Yeah, agree with @DaNious, tbh, I did the similarity comparison with the cosine_similarity, and also get a proper result. The example with "softmax" could, maybe I was wrong, people will get confused with the "activation" concept during NN forward, which represents the probability.

Jul 05 '23 08:07 XinyueZ

@XinyueZ, I am interested in computing the text x image similarity. But I am confused on how to do this. Could you please share an example code? Much appreciated. Thank you!

Jul 05 '23 16:07 bakachan19

class SimilarityCalculator:
    def __init__(self, device):
        self.device = device

    def __call__(self, embeddings):
        raise NotImplementedError

    @staticmethod
    def create_instance(similarity_type, device):
        if similarity_type == SimilarityCalculatorType.COSINE:
            return CosineSimilarity(device)
        elif similarity_type == SimilarityCalculatorType.SOFTMAX:
            return SoftmaxSimilarity(device)
        else:
            raise ValueError(f"Unknown similarity type: {similarity_type}")


class CosineSimilarity(SimilarityCalculator):
    def __init__(self, device):
        super().__init__(device)

    def __call__(self, embeddings):
        preds = cosine_similarity(
            embeddings[ModalityType.VISION].cpu().numpy(),
            embeddings[ModalityType.TEXT].cpu().numpy(),
        )
        preds = torch.from_numpy(preds).to(self.device)
        print(
            preds.shape,
            "\n",
            preds,
            "max: ",
            preds.max(dim=0),
            "min: ",
            preds.min(dim=0),
        )
        return preds


class SoftmaxSimilarity(SimilarityCalculator):
    def __init__(self, device):
        super().__init__(device)

    def __call__(self, embeddings):
        preds = torch.softmax(
            embeddings[ModalityType.VISION] @ embeddings[ModalityType.TEXT].T,
            dim=0,
        )
        print(
            preds.shape,
            "\n",
            preds,
            "sum: ",
            preds.sum(dim=0),
            "max: ",
            preds.max(dim=0),
            "min: ",
            preds.min(dim=0),
        )
        return preds


    def _infer(self, vision, text, threshold=0.5): 
        inputs = {
            ModalityType.VISION: ImageBindClassifier._load_and_transform_vision_data_np(
                vision,
                self.device + ":1" if self.device == "cuda" else self.device,
            ),
            ModalityType.TEXT: ImageBindClassifier._load_and_transform_text(
                text,
                self.bpe_path,
                self.device + ":0" if self.device == "cuda" else self.device,
            ),
        }

        with torch.no_grad():
            embeddings = self.model(inputs)

        preds = self.similarity_calculator(embeddings) # softmax OR from sklearn.metrics.pairwise import cosine_similarity
        return preds

@bakachan19

Jul 05 '23 19:07 XinyueZ

Thank you so so much @XinyueZ.

Jul 06 '23 11:07 bakachan19

ImageBind ImageBind copied to clipboard

Is OK to use cosine_similarity instead softmax for VISION x TEXT ?

ImageBind
ImageBind copied to clipboard