Discrepancy in embeddings similarity between Infinity and SentenceTransformer / HF TEI
I’ve observed significant discrepancies in the embeddings produced by Infinity compared to SentenceTransformer for the same model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2.
Example
When computing the cosine similarity between the embeddings of the two inputs mountains and joyeux noel:
def cosine_similarity(vector1, vector2):
"""
Calculate cosine similarity between two vectors
"""
dot_product = np.dot(vector1, vector2)
magnitude1 = np.linalg.norm(vector1)
magnitude2 = np.linalg.norm(vector2)
if magnitude1 == 0 or magnitude2 == 0:
return 0
return dot_product / (magnitude1 * magnitude2)
- Infinity result:
0.497474 - SentenceTransformer result:
0.354079
The similarity score from SentenceTransformer matches what is reported in both:
- Hugging Face UI
- Hugging Face Text Embeddings Inference (TEI)
This suggests Infinity is producing different embeddings than the expected reference implementations.
Reproduction
Infinity (CPU):
docker run --rm -it \
-p 8080:8080 \
michaelf34/infinity:latest-cpu \
v2 \
--engine optimum \
--port 8080 \
--model-id sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Hugging Face TEI (CPU):
docker run -p 8081:80 -v $volume:/data --pull always \
ghcr.io/huggingface/text-embeddings-inference-cpu:1.8 \
--model-id sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
SentenceTransformer code
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
embeddings = model.encode([text1, text2])
Hi @fabriziofortino ,
thanks for the detailed issue. Can you run infinity with --engine torch and see if you get the expected output?
@wirthual results with --engine torch look good. Is this a bug? The Docker container for CPU documentation says: Optimum/Onnx is often the prefered engine.
With optimum infinity uses the quantized model by default on CPU if provided by the HF repo. In order to compare the outputs, we need to make sure we run the same models with the same settings.
Setting the same pooling method by adding --pooling-method mean for infinity and using the same model version in sentence-transformer will give the same result. (Infinity prints out which model it loads during startup)
So to compare to the same model:
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2',backend="onnx",model_kwargs={"file_name": "onnx/model_quint8_avx2.onnx"},)
Results in similarities of:
0.35728836 and 0.3572884
@wirthual thanks for the explanation. I re-ran the same test with the above inputs and
docker run --rm -it \
-p 8080:8080 \
michaelf34/infinity:latest-cpu \
v2 \
--engine optimum --pooling-method mean \
--port 8080 \
--model-id sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
The resulting cosine similarity is 0.380676 which is better than the previous value but still significantly different.
Hi @wirthual , Couple of questions
- How do we disable quantization with optimum - use
--dtype float32for above? And it will also lead to increase in latency right? - Also ---pooling-method, the default is auto, so it select
cls? BTW should it not use the model config which specifies mean pooling for the above model - https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/blob/main/1_Pooling/config.json
@fabriziofortino
I assume the used model is still quantized (On CPU thats the default). Working on this PR #635 for easy selection of the unquantized version. Did a quick test on the branch and the similarity value matches with HF and ST:
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import numpy as np
import os
def cosine_similarity(vector1, vector2):
"""
Calculate cosine similarity between two vectors
"""
dot_product = np.dot(vector1, vector2)
magnitude1 = np.linalg.norm(vector1)
magnitude2 = np.linalg.norm(vector2)
if magnitude1 == 0 or magnitude2 == 0:
return 0
return dot_product / (magnitude1 * magnitude2)
sentences = ["mountains", "joyeux noel"]
array = AsyncEngineArray.from_args([
EngineArgs(model_name_or_path = "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2", engine="optimum",pooling_method="mean",onnx_disable_optimize=True,onnx_do_not_prefer_quantized=True)
])
async def embed_text(engine: AsyncEmbeddingEngine):
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
print(cosine_similarity(embeddings[0],embeddings[1]))
asyncio.run(embed_text(array[0]))
Results in 0.35407886
Hi @amit-jain
- How do we disable quantization with optimum - use
--dtype float32for above? And it will also lead to increase in latency right?
going forward, the unquantized version can be selected by providing the onnx_do_not_prefer_quantized flag.
Thats the typical tradeoff.
- Also ---pooling-method, the default is auto, so it select
cls? BTW should it not use the model config which specifies mean pooling for the above model - https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2/blob/main/1_Pooling/config.json
Yes, thats the current implementation. Good point, will look into the selection based on the config