langchain
langchain copied to clipboard
Precision of HuggingFaceEmbeddings.embed_query changes
System Info
Langchain version: 0.0.173 numpy version: 1.24.3
Related Components
- [X] Embedding Models
Reproduction
from sentence_transformers import SentenceTransformer
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings
t = 'langchain embedding'
m = HuggingFaceEmbeddings(encode_kwargs={"normalize_embeddings": True})
# SentenceTransformer embeddings with unit norm
x = SentenceTransformer(m.model_name).encode(t, normalize_embeddings=True)
# Langchain.Huggingface embeddings with unit norm
y = m.embed_query(t)
print(f'L2 norm of SentenceTransformer: {np.linalg.norm(x)}. \nL2 norm of Langchain.Huggingface: {np.linalg.norm(y)}')
Expected behavior
Both of these two L2 norm results shoud be 1. But I got as blow:
L2 norm of SentenceTransformer: 1.0.
L2 norm of Langchain.Huggingface: 1.0000000445724682
I think the problem came from this code. When converting array to list, the numbers got bigger
In my case, when I used this embedding in FAISS vector store, the relevance_score I got cannot be limited between 0 and 1
Hi, @alfred-liu96! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue you reported is about the precision of the L2 norm calculation in the HuggingFaceEmbeddings.embed_query
function. It seems that when converting an array to a list, the numbers become slightly larger. You mentioned in a comment that when using this embedding in FAISS vector store, the relevance_score obtained cannot be limited between 0 and 1.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain project!