No score and good results on PgVector when using cosine_similarity because of the DESC and the 1 - cosine_distance. Wrong usage of order by
Describe the bug When using the PgvectorDocumentStore as stated in the documentation of pgvector :
The query needs to have an ORDER BY and LIMIT, and the ORDER BY must be the result of a distance operator (not an expression) in ascending order.
Thus the documents and score are not being well retrieved as it is not using the index because there is currently the DESC order being used and that it is an expression.
To Reproduce
# Creating pipeline with retriever
from haystack.core.pipeline import AsyncPipeline
from haystack_integrations.components.retrievers.pgvector import (
PgvectorEmbeddingRetriever,
)
from haystack_integrations.document_stores.pgvector.document_store import (
PgvectorDocumentStore,
)
from urllib import parse
conn_user_info = f"username:password"
conn_db_info = f"hostname:port/database"
document_store = PgvectorDocumentStore(
connection_string=Secret.from_token(f"postgres://{conn_user_info}@{conn_db_info}"),
schema_name="schema_name",
table_name="table",
embedding_dimension=1000,
vector_function="cosine_similarity",
recreate_table=False,
search_strategy="hnsw",
create_extension=False,
)
# running
sql_query, params = document_store._check_and_build_embedding_retrieval_query(
query_embedding=query_embedding, vector_function='cosine_similarity', top_k=5
)
print(sql_query)
#Composed([SQL('SELECT *, '), SQL("1 - (embedding <=> '[-0.0022329981438815594,-0.05632176622748375,-0.012213421985507011,0.021385865285992622,0.1035347431898117,...]') AS score"), SQL(' FROM '), Identifier('ms_conversation_system'), SQL('.'), Identifier('tipsheet_chunk'), SQL(''), SQL(' ORDER BY score '), SQL('DESC'), SQL(' LIMIT '), Literal(5)])
The query contains the DESC and we are no retrieving any scores whereas if you remove the DESC and you order by embedding <=> ([-0.0022329981438815594,-0.05632176622748375,-0.012213421985507011,0.021385865285992622,0.1035347431898117,...] You will get the score using the index and good results. Describe your environment (please complete the following information):
- OS: iOS, linux
- Haystack version:
│ haystack-ai │ 2.16.1 │ │
│ haystack-experimental │ 0.12.0 │ │
│ pgvector-haystack │ 5.2.1 │
- Integration version: pgvector version 0.8.0
Hey @plopezamaya, thanks for your report.
- No score This looks strange... Having a complete reproducible example would help to debug the issue. In the following example, I actually get scores.
import os
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever
from haystack import Document
from numpy.random import rand
os.environ["PG_CONN_STR"] = "postgresql://postgres:postgres@localhost:5432/postgres"
document_store = PgvectorDocumentStore(search_strategy="hnsw", vector_function="cosine_similarity", recreate_table=True)
query_embedding = [0.1] * 768
most_similar_embedding = [0.8] * 768
second_best_embedding = [0.8] * 700 + [0.1] * 3 + [0.2] * 65
another_embedding = rand(768).tolist()
docs = [
Document(content="Most similar document (cosine sim)", embedding=most_similar_embedding),
Document(content="2nd best document (cosine sim)", embedding=second_best_embedding),
Document(content="Not very similar document (cosine sim)", embedding=another_embedding),
]
document_store.write_documents(docs)
retriever = PgvectorEmbeddingRetriever(document_store=document_store)
print(retriever.run(query_embedding=query_embedding))
# {'documents': [Document(id=..., content: 'Most similar document (cosine sim)', score: 1.0, embedding: vector of size 768), Document(id=..., content: '2nd best document (cosine sim)', score: 0.9745216149948613, embedding: vector of size 768), Document(id=..., content: 'Not very similar document (cosine sim)', score: 0.8717324925233269, embedding: vector of size 768)]}
- Not using index This might be possible, as you underlined. To compute cosine similarity, we are following Pgvector docs: https://github.com/pgvector/pgvector?tab=readme-ov-file#distances. If you have a better way to achieve the same result, feel free to share it.
Thanks for you reply @anakin87 . That is really weird, it will be hard to give you a real example as it is internal enterprise data. Do we have the same version of haystack packages and pgvector extension ? Yep indeed the compute of the similarity is like that but the issues comes from :
The query needs to have an ORDER BY and LIMIT, and the ORDER BY must be the result of a distance operator (not an expression) in ascending order.
The workaround that we did while waiting for the answer and as we knew we were only going to use cosine_similairty was to override PgvectorDocumentStore._check_and_build_embedding_retrieval_query :
# we always want to return the most similar documents first
# so when using l2_distance and cosine_similarity the sort order must be ASC
sort_order = "ASC" if vector_function in ["l2_distance", "cosine_similarity"] else "DESC"
order_by = "score" if vector_function != "cosine_similarity" else f"(embedding <=> {query_embedding_for_postgres})"
sql_sort = SQL(" ORDER BY {order_by} {sort_order} LIMIT {top_k}").format(
top_k=SQLLiteral(top_k),
sort_order=SQL(sort_order),
order_by=SQL(order_by),
)
sql_query = sql_select + sql_where_clause + sql_sort
return sql_query, params
With this we obtain the scores and use the indexes. We also respect the troubleshooting from pgvector to not hae an expression.
We tried the given sql query directly on pgvector and have the same results (no score without the modifications of the query and score with them).
OK, thanks for the clarification. Still strange that scores are not returned, but I think that we should improve our code to make sure we use the index.