langchain The scores returned by 'similarity_search_with

The scores returned by 'similarity_search_with_score' are NOT in descending order

Open arielsho opened this issue 1 year ago • 1 comments

Hello, I came across a problem when using "similarity_search_with_score". According to the doc, it should return "not only the documents but also the similarity score of the query to them". docs_and_scores = db.similarity_search_with_score(query) However, I noticed the scores for the top-5 docs are: [0.40305698, 0.43590686, 0.4464777, 0.46140206, 0.46226424], which are not sorted in a descending order. Did anyone have the same problem?

Apr 13 '23 17:04 arielsho

I have the exact same problem and the chain uses the first document as context to the model. Hence the response is way off

Apr 14 '23 10:04 sebastianmacarescu

The score means the distance. The first one has the min value.

Apr 17 '23 18:04 HMS97

@HMS97 That makes sense, thank you!

Apr 17 '23 18:04 arielsho

It seems this is not standardized across databases? If I use FAISS the score is "higher if closer to 0" but if I use Pinecone the score is "higher if closer to 1"... this doesn't make sense.

May 11 '23 09:05 acalatrava

Do anyone have any solution for this problem?

In pgvector.py, the order has been fixed as asc, so if we use pgvector as retriever, it will return the least relevant instead of the most relevant as expected.

https://github.com/langchain-ai/langchain/blob/bed06a4f4ab802bedb3533021da920c05a736810/libs/langchain/langchain/vectorstores/pgvector.py#L458C17-L458C17

Nov 15 '23 11:11 huantt

The score means the distance. The first one has the min value.

It will be quite confusing when the score represents the distance instead of the most relevant. If you're correct, how will score_threshold work?

In vectorstores/base.py, they say:

score_threshold: Minimum relevance threshold
                        for similarity_score_threshold

This means that if you specify score_threshold in the as_retriever function, only the documents with the score greater than this value will be returned.

Nov 15 '23 14:11 huantt

I fixed it by extending PGVector class then overrde functions to return the score relative to the similarity.

class MyPGVector(PGVector):
    def similarity_search_with_score(
            self,
            query: str,
            k: int = 4,
            filter: Optional[dict] = None,
    ) -> List[Tuple[Document, float]]:
        docs = super().similarity_search_with_score(query, k, filter)
        return [(doc, 1.0 - score) for doc, score in docs]

Nov 16 '23 02:11 huantt

how can i use filter and threshold value for this

retrieved_docs = Knowledge_vector_database.similarity_search_with_score(query=query, k=5)

Feb 21 '24 17:02 naveenfaclon

langchain langchain copied to clipboard

The scores returned by 'similarity_search_with_score' are NOT in descending order

langchain
langchain copied to clipboard