Wrong similarity score for identical embeddings

Open SaidKhudoyan opened this issue 11 months ago • 1 comments

Testing the bge-m3 embedding model, I wanted to see how it behaves under varying scenarios. After generating sparse embeddings and storing them in some json, I wanted to calculate their similarity using the _compute_single_lexical_matching_score method, which is defined in FlagEmbedding/inference/embedder/encoder_only/m3.py. However, I got e.g. only a score of 0.23 when comparing identical sparse-embeddings

Here an output from my terminal: Teste Sparse Similarity Berechnung mit konvertierten Embeddings... Sparse Similarity Score: 0.23759149310728778 Similarity Berechnung erfolgreich! Sparse 1: {35542: 0.16986805200576782, 443: 0.1528966724872589, 599: 0.0936431884765625, 8647: 0.30713802576065063, 9: 0.04344563186168671, 174379: 0.2834935784339905} Sparse 2: {35542: 0.16986805200576782, 443: 0.1528966724872589, 599: 0.0936431884765625, 8647: 0.30713802576065063, 9: 0.04344563186168671, 174379: 0.2834935784339905}

Maybe I'm wrong, but wouldnt we need some kind of normalization factor for that? Currently only a simple dot-product is conducted.

Jan 20 '25 11:01 SaidKhudoyan

Since sparse embeddings are not normalized, the sparse embedding similarity between identical embeddings cannot reach 1. It doesn't need normalization.

Jan 23 '25 09:01 545999961