langchain icon indicating copy to clipboard operation
langchain copied to clipboard

The scores returned by 'similarity_search_with_score' are NOT in descending order

Open arielsho opened this issue 1 year ago • 1 comments

Hello, I came across a problem when using "similarity_search_with_score". According to the doc, it should return "not only the documents but also the similarity score of the query to them". docs_and_scores = db.similarity_search_with_score(query) However, I noticed the scores for the top-5 docs are: [0.40305698, 0.43590686, 0.4464777, 0.46140206, 0.46226424], which are not sorted in a descending order. Did anyone have the same problem?

arielsho avatar Apr 13 '23 17:04 arielsho

I have the exact same problem and the chain uses the first document as context to the model. Hence the response is way off

sebastianmacarescu avatar Apr 14 '23 10:04 sebastianmacarescu

The score means the distance. The first one has the min value.

HMS97 avatar Apr 17 '23 18:04 HMS97

@HMS97 That makes sense, thank you!

arielsho avatar Apr 17 '23 18:04 arielsho

It seems this is not standardized across databases? If I use FAISS the score is "higher if closer to 0" but if I use Pinecone the score is "higher if closer to 1"... this doesn't make sense.

acalatrava avatar May 11 '23 09:05 acalatrava

Do anyone have any solution for this problem?

In pgvector.py, the order has been fixed as asc, so if we use pgvector as retriever, it will return the least relevant instead of the most relevant as expected.

image

https://github.com/langchain-ai/langchain/blob/bed06a4f4ab802bedb3533021da920c05a736810/libs/langchain/langchain/vectorstores/pgvector.py#L458C17-L458C17

huantt avatar Nov 15 '23 11:11 huantt

The score means the distance. The first one has the min value.

It will be quite confusing when the score represents the distance instead of the most relevant. If you're correct, how will score_threshold work?

In vectorstores/base.py, they say:

score_threshold: Minimum relevance threshold
                        for similarity_score_threshold

This means that if you specify score_threshold in the as_retriever function, only the documents with the score greater than this value will be returned.

huantt avatar Nov 15 '23 14:11 huantt

I fixed it by extending PGVector class then overrde functions to return the score relative to the similarity.

class MyPGVector(PGVector):
    def similarity_search_with_score(
            self,
            query: str,
            k: int = 4,
            filter: Optional[dict] = None,
    ) -> List[Tuple[Document, float]]:
        docs = super().similarity_search_with_score(query, k, filter)
        return [(doc, 1.0 - score) for doc, score in docs]

huantt avatar Nov 16 '23 02:11 huantt

how can i use filter and threshold value for this

retrieved_docs = Knowledge_vector_database.similarity_search_with_score(query=query, k=5)

naveenfaclon avatar Feb 21 '24 17:02 naveenfaclon