langchain
langchain copied to clipboard
FAISS similarity search with score issue
Hi.
I am trying to find out the similarity search score. but I got the score In 3 digits.
Please provide working code that can reproduce this issue.
Thanks for raising this issue!
You have to specify a normalization function that works with your embeddings. In this case, the default normalization function assumes unit norm, unfortunately, but HuggingFaceEmbeddings don't do this by default.
The default is
def _default_relevance_score_fn(score: float) -> float:
"""Return a similarity score on a scale [0, 1]."""
# The 'correct' relevance function
# may differ depending on a few things, including:
# - the distance / similarity metric used by the VectorStore
# - the scale of your embeddings (OpenAI's are unit normed. Many others are not!)
# - embedding dimensionality
# - etc.
# This function converts the euclidean norm of normalized embeddings
# (0 is most similar, sqrt(2) most dissimilar)
# to a similarity function (0 to 1)
return 1.0 - score / math.sqrt(2)
You could try instead to use
def score_normalizer(val: float) -> float:
return 1 - 1 / (1 + np.exp(val))
And initialize Faiss with relevance_score_fn=score_normalizer
@vowelparrot
A few questions
-
Seems this is for relevance score, is it same as the score defined in
similarity_search_with_score
? I notice it only invoked here https://github.com/hwchase17/langchain/blob/f0cfed636f37ea7c5171541e0df3f814858f1550/langchain/vectorstores/faiss.py#L475-L488.similarity_search_with_score_by_vector
used bysimilarity_search_with_score
doesn't have any post processing https://github.com/hwchase17/langchain/blob/f0cfed636f37ea7c5171541e0df3f814858f1550/langchain/vectorstores/faiss.py#L180 If that should be fixed, I can help on it. BTW, I am thinking whether it's a problem of HuggingFaceEmbeddings() or Faiss? Do you think it makes sense tofaiss.normalize_L2(vector)
beforeindex.add()
orindex.search()
like this https://github.com/hwchase17/langchain/pull/4443 -
I change to your normalizer but after that, all scores become 1.. Did you see similar issues?
score : 68.8683853149414, result : 1.0
score : 74.28987121582031, result : 1.0
score : 77.00814056396484, result : 1.0
score : 79.08580017089844, result : 1.0
-
I am using
from_documents
API, is it elegant to have arelevance_score_fn
argument there? -
Seems other solutions like Chroma doesn't accept
relevance_score_fn
, should all vector db accept it?
@Jeffwan its a good question whether its huggingface embeddings or FAISS. i like your PR, but im going to add it as an optional parameter probably, both for backwards compatability and flexilibty
Hi, @Chetan-Yeola! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.
From what I understand, the issue is about a problem with the similarity search score in FAISS, where the score is being displayed with only 3 digits instead of the expected format. PawelFaron requested the author to provide working code that can reproduce the issue. Vowelparrot suggested using a different normalization function to fix the issue, and Jeffwan raised some questions about the relevance score and suggested a potential fix. Hwchase17 mentioned adding the fix as an optional parameter for backwards compatibility and flexibility.
The good news is that the issue has been resolved! The team has used the different normalization function suggested by Vowelparrot, and Jeffwan provided some suggestions for fixing the relevance score. Hwchase17 also proposed adding the fix as an optional parameter for backwards compatibility and flexibility.
Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.
Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.
Best regards, Dosu
Hello, I have encountered the same problem. Can you tell me how you solved it?
Hey @lmz0506
It looks like this is due to the fact, that the Hugging Face model doesn't work with the default normalize function. I resolved this by writing my own normalize function. It works by mapping the range 0
- β
on to the range 0
- 1
. There are different ways to map the range 0
- β
on to the range 0
- 1
. One of those functions already got mentioned by vowelparrot
You could turn normalize_L2 = True
manually on the path of /your_envs/langchain-chatchat-qwen/lib/python3.10/site-packages/langchain/vectorstores/faiss.py
as below:
ζ¨ε―δ»₯ζη §δ»₯δΈζΉεΌ
normalize_L2 = True
ζε¨ζεΌθ·―εΎοΌ/your_envs/langchain-chatchat-qwen/lib/python3.10/site-packages/langchain/vectorstores/faiss.py
If you use FAISS.from_text, you should also set __from(normalize_L2 = True)
to solve the promblem