langchain
langchain copied to clipboard
The scores returned by 'similarity_search_with_score' are NOT in descending order
Hello, I came across a problem when using "similarity_search_with_score".
According to the doc, it should return "not only the documents but also the similarity score of the query to them".
docs_and_scores = db.similarity_search_with_score(query)
However, I noticed the scores for the top-5 docs are: [0.40305698, 0.43590686, 0.4464777, 0.46140206, 0.46226424], which are not sorted in a descending order.
Did anyone have the same problem?
I have the exact same problem and the chain uses the first document as context to the model. Hence the response is way off
The score means the distance. The first one has the min value.
@HMS97 That makes sense, thank you!
It seems this is not standardized across databases? If I use FAISS the score is "higher if closer to 0" but if I use Pinecone the score is "higher if closer to 1"... this doesn't make sense.
Do anyone have any solution for this problem?
In pgvector.py
, the order has been fixed as asc
, so if we use pgvector as retriever, it will return the least relevant
instead of the most relevant
as expected.
https://github.com/langchain-ai/langchain/blob/bed06a4f4ab802bedb3533021da920c05a736810/libs/langchain/langchain/vectorstores/pgvector.py#L458C17-L458C17
The score means the distance. The first one has the min value.
It will be quite confusing when the score represents the distance instead of the most relevant
.
If you're correct, how will score_threshold
work?
In vectorstores/base.py
, they say:
score_threshold: Minimum relevance threshold
for similarity_score_threshold
This means that if you specify score_threshold
in the as_retriever
function, only the documents with the score greater than this value will be returned.
I fixed it by extending PGVector class then overrde functions to return the score relative to the similarity.
class MyPGVector(PGVector):
def similarity_search_with_score(
self,
query: str,
k: int = 4,
filter: Optional[dict] = None,
) -> List[Tuple[Document, float]]:
docs = super().similarity_search_with_score(query, k, filter)
return [(doc, 1.0 - score) for doc, score in docs]
how can i use filter and threshold value for this
retrieved_docs = Knowledge_vector_database.similarity_search_with_score(query=query, k=5)