langchain icon indicating copy to clipboard operation
langchain copied to clipboard

FAISS similarity search with score issue

Open Chetan-Yeola opened this issue 1 year ago β€’ 4 comments

Hi. I am trying to find out the similarity search score. but I got the score In 3 digits. image

Chetan-Yeola avatar May 04 '23 05:05 Chetan-Yeola

Please provide working code that can reproduce this issue.

PawelFaron avatar May 04 '23 13:05 PawelFaron

Thanks for raising this issue!

You have to specify a normalization function that works with your embeddings. In this case, the default normalization function assumes unit norm, unfortunately, but HuggingFaceEmbeddings don't do this by default.

The default is

def _default_relevance_score_fn(score: float) -> float:
    """Return a similarity score on a scale [0, 1]."""
    # The 'correct' relevance function
    # may differ depending on a few things, including:
    # - the distance / similarity metric used by the VectorStore
    # - the scale of your embeddings (OpenAI's are unit normed. Many others are not!)
    # - embedding dimensionality
    # - etc.
    # This function converts the euclidean norm of normalized embeddings
    # (0 is most similar, sqrt(2) most dissimilar)
    # to a similarity function (0 to 1)
    return 1.0 - score / math.sqrt(2)

You could try instead to use

def score_normalizer(val: float) -> float:
    return 1 - 1 / (1 + np.exp(val))

And initialize Faiss with relevance_score_fn=score_normalizer

vowelparrot avatar May 04 '23 19:05 vowelparrot

@vowelparrot

A few questions

  1. Seems this is for relevance score, is it same as the score defined in similarity_search_with_score? I notice it only invoked here https://github.com/hwchase17/langchain/blob/f0cfed636f37ea7c5171541e0df3f814858f1550/langchain/vectorstores/faiss.py#L475-L488. similarity_search_with_score_by_vector used by similarity_search_with_score doesn't have any post processing https://github.com/hwchase17/langchain/blob/f0cfed636f37ea7c5171541e0df3f814858f1550/langchain/vectorstores/faiss.py#L180 If that should be fixed, I can help on it. BTW, I am thinking whether it's a problem of HuggingFaceEmbeddings() or Faiss? Do you think it makes sense to faiss.normalize_L2(vector) before index.add() or index.search() like this https://github.com/hwchase17/langchain/pull/4443

  2. I change to your normalizer but after that, all scores become 1.. Did you see similar issues?

score : 68.8683853149414, result : 1.0
score : 74.28987121582031, result : 1.0
score : 77.00814056396484, result : 1.0
score : 79.08580017089844, result : 1.0
  1. I am using from_documents API, is it elegant to have a relevance_score_fn argument there?

  2. Seems other solutions like Chroma doesn't accept relevance_score_fn, should all vector db accept it?

Jeffwan avatar May 10 '23 06:05 Jeffwan

@Jeffwan its a good question whether its huggingface embeddings or FAISS. i like your PR, but im going to add it as an optional parameter probably, both for backwards compatability and flexilibty

hwchase17 avatar May 18 '23 04:05 hwchase17

Hi, @Chetan-Yeola! I'm Dosu, and I'm helping the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, the issue is about a problem with the similarity search score in FAISS, where the score is being displayed with only 3 digits instead of the expected format. PawelFaron requested the author to provide working code that can reproduce the issue. Vowelparrot suggested using a different normalization function to fix the issue, and Jeffwan raised some questions about the relevance score and suggested a potential fix. Hwchase17 mentioned adding the fix as an optional parameter for backwards compatibility and flexibility.

The good news is that the issue has been resolved! The team has used the different normalization function suggested by Vowelparrot, and Jeffwan provided some suggestions for fixing the relevance score. Hwchase17 also proposed adding the fix as an optional parameter for backwards compatibility and flexibility.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your contribution to the LangChain repository! Let us know if you have any further questions or concerns.

Best regards, Dosu

dosubot[bot] avatar Sep 15 '23 16:09 dosubot[bot]

Hello, I have encountered the same problem. Can you tell me how you solved it?

lmz0506 avatar Oct 11 '23 03:10 lmz0506

Hey @lmz0506

It looks like this is due to the fact, that the Hugging Face model doesn't work with the default normalize function. I resolved this by writing my own normalize function. It works by mapping the range 0 - ∞ on to the range 0 - 1. There are different ways to map the range 0 - ∞ on to the range 0 - 1. One of those functions already got mentioned by vowelparrot

Yanni8 avatar Feb 05 '24 10:02 Yanni8

You could turn normalize_L2 = True manually on the path of /your_envs/langchain-chatchat-qwen/lib/python3.10/site-packages/langchain/vectorstores/faiss.py as below: image

AlvinAi96 avatar Mar 14 '24 02:03 AlvinAi96

您可δ»₯ζŒ‰η…§δ»₯下方式normalize_L2 = Trueζ‰‹εŠ¨ζ‰“εΌ€θ·―εΎ„οΌš/your_envs/langchain-chatchat-qwen/lib/python3.10/site-packages/langchain/vectorstores/faiss.py 图像

If you use FAISS.from_text, you should also set __from(normalize_L2 = True) to solve the promblem

LXlearning avatar May 28 '24 03:05 LXlearning