infinity icon indicating copy to clipboard operation
infinity copied to clipboard

Discrepancy in embeddings similarity between Infinity and SentenceTransformer / HF TEI

Open fabriziofortino opened this issue 4 months ago • 7 comments

I’ve observed significant discrepancies in the embeddings produced by Infinity compared to SentenceTransformer for the same model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.

Example

When computing the cosine similarity between the embeddings of the two inputs mountains and joyeux noel:

def cosine_similarity(vector1, vector2):
    """
    Calculate cosine similarity between two vectors
    """
    dot_product = np.dot(vector1, vector2)
    magnitude1 = np.linalg.norm(vector1)
    magnitude2 = np.linalg.norm(vector2)
    
    if magnitude1 == 0 or magnitude2 == 0:
        return 0
    
    return dot_product / (magnitude1 * magnitude2)
  • Infinity result: 0.497474
  • SentenceTransformer result: 0.354079

The similarity score from SentenceTransformer matches what is reported in both:

  • Hugging Face UI
  • Hugging Face Text Embeddings Inference (TEI)

This suggests Infinity is producing different embeddings than the expected reference implementations.

Image

Reproduction

Infinity (CPU):

docker run --rm -it \
  -p 8080:8080 \
  michaelf34/infinity:latest-cpu \
  v2 \
    --engine optimum \
    --port 8080 \
    --model-id sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Hugging Face TEI (CPU):

docker run -p 8081:80 -v $volume:/data --pull always \
  ghcr.io/huggingface/text-embeddings-inference-cpu:1.8 \
  --model-id sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

SentenceTransformer code

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2') 
embeddings = model.encode([text1, text2])

fabriziofortino avatar Aug 21 '25 21:08 fabriziofortino

Hi @fabriziofortino ,

thanks for the detailed issue. Can you run infinity with --engine torch and see if you get the expected output?

wirthual avatar Aug 22 '25 07:08 wirthual

@wirthual results with --engine torch look good. Is this a bug? The Docker container for CPU documentation says: Optimum/Onnx is often the prefered engine.

fabriziofortino avatar Aug 22 '25 07:08 fabriziofortino

With optimum infinity uses the quantized model by default on CPU if provided by the HF repo. In order to compare the outputs, we need to make sure we run the same models with the same settings.

Setting the same pooling method by adding --pooling-method mean for infinity and using the same model version in sentence-transformer will give the same result. (Infinity prints out which model it loads during startup)

So to compare to the same model:

model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',backend="onnx",model_kwargs={"file_name": "onnx/model_quint8_avx2.onnx"},) 

Results in similarities of: 0.35728836 and 0.3572884

wirthual avatar Aug 22 '25 08:08 wirthual

@wirthual thanks for the explanation. I re-ran the same test with the above inputs and

docker run --rm -it \
  -p 8080:8080 \
  michaelf34/infinity:latest-cpu \
  v2 \
    --engine optimum --pooling-method mean \
    --port 8080 \
    --model-id sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

The resulting cosine similarity is 0.380676 which is better than the previous value but still significantly different.

fabriziofortino avatar Aug 22 '25 15:08 fabriziofortino

Hi @wirthual , Couple of questions

  • How do we disable quantization with optimum - use --dtype float32 for above? And it will also lead to increase in latency right?
  • Also ---pooling-method, the default is auto, so it select cls? BTW should it not use the model config which specifies mean pooling for the above model - https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/blob/main/1_Pooling/config.json

amit-jain avatar Aug 25 '25 10:08 amit-jain

@fabriziofortino

I assume the used model is still quantized (On CPU thats the default). Working on this PR #635 for easy selection of the unquantized version. Did a quick test on the branch and the similarity value matches with HF and ST:

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import numpy as np
import os

def cosine_similarity(vector1, vector2):
    """
    Calculate cosine similarity between two vectors
    """
    dot_product = np.dot(vector1, vector2)
    magnitude1 = np.linalg.norm(vector1)
    magnitude2 = np.linalg.norm(vector2)
    
    if magnitude1 == 0 or magnitude2 == 0:
        return 0
    
    return dot_product / (magnitude1 * magnitude2)


sentences = ["mountains", "joyeux noel"]
array = AsyncEngineArray.from_args([
  EngineArgs(model_name_or_path = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", engine="optimum",pooling_method="mean",onnx_disable_optimize=True,onnx_do_not_prefer_quantized=True)
])

async def embed_text(engine: AsyncEmbeddingEngine): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
        print(cosine_similarity(embeddings[0],embeddings[1]))
    
asyncio.run(embed_text(array[0]))

Results in 0.35407886

wirthual avatar Aug 25 '25 13:08 wirthual

Hi @amit-jain

  • How do we disable quantization with optimum - use --dtype float32 for above? And it will also lead to increase in latency right?

going forward, the unquantized version can be selected by providing the onnx_do_not_prefer_quantized flag.

Thats the typical tradeoff.

  • Also ---pooling-method, the default is auto, so it select cls? BTW should it not use the model config which specifies mean pooling for the above model - https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/blob/main/1_Pooling/config.json

Yes, thats the current implementation. Good point, will look into the selection based on the config

wirthual avatar Aug 25 '25 14:08 wirthual