haystack-core-integrations icon indicating copy to clipboard operation
haystack-core-integrations copied to clipboard

No score and good results on PgVector when using cosine_similarity because of the DESC and the 1 - cosine_distance. Wrong usage of order by

Open plopezamaya opened this issue 3 months ago • 3 comments

Describe the bug When using the PgvectorDocumentStore as stated in the documentation of pgvector :

The query needs to have an ORDER BY and LIMIT, and the ORDER BY must be the result of a distance operator (not an expression) in ascending order.

Thus the documents and score are not being well retrieved as it is not using the index because there is currently the DESC order being used and that it is an expression.

To Reproduce


# Creating pipeline with retriever
from haystack.core.pipeline import AsyncPipeline
from haystack_integrations.components.retrievers.pgvector import (
    PgvectorEmbeddingRetriever,
)
from haystack_integrations.document_stores.pgvector.document_store import (
    PgvectorDocumentStore,
)
from urllib import parse

conn_user_info = f"username:password"
conn_db_info = f"hostname:port/database"
document_store = PgvectorDocumentStore(
                connection_string=Secret.from_token(f"postgres://{conn_user_info}@{conn_db_info}"),
                schema_name="schema_name",
                table_name="table",
                embedding_dimension=1000,
                vector_function="cosine_similarity",
                recreate_table=False,
                search_strategy="hnsw",
                create_extension=False,
            )


# running
sql_query, params = document_store._check_and_build_embedding_retrieval_query(
            query_embedding=query_embedding, vector_function='cosine_similarity', top_k=5
        )
print(sql_query)
#Composed([SQL('SELECT *, '), SQL("1 - (embedding <=> '[-0.0022329981438815594,-0.05632176622748375,-0.012213421985507011,0.021385865285992622,0.1035347431898117,...]') AS score"), SQL(' FROM '), Identifier('ms_conversation_system'), SQL('.'), Identifier('tipsheet_chunk'), SQL(''), SQL(' ORDER BY score '), SQL('DESC'), SQL(' LIMIT '), Literal(5)])

The query contains the DESC and we are no retrieving any scores whereas if you remove the DESC and you order by embedding <=> ([-0.0022329981438815594,-0.05632176622748375,-0.012213421985507011,0.021385865285992622,0.1035347431898117,...] You will get the score using the index and good results. Describe your environment (please complete the following information):

  • OS: iOS, linux
  • Haystack version:
│ haystack-ai                              │ 2.16.1                         │                                                            │
│ haystack-experimental                    │ 0.12.0                         │                                                            │
│ pgvector-haystack                        │ 5.2.1                          │
  • Integration version: pgvector version 0.8.0

plopezamaya avatar Aug 19 '25 12:08 plopezamaya

Hey @plopezamaya, thanks for your report.

  1. No score This looks strange... Having a complete reproducible example would help to debug the issue. In the following example, I actually get scores.
import os
from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack_integrations.components.retrievers.pgvector import PgvectorEmbeddingRetriever

from haystack import Document
from numpy.random import rand

os.environ["PG_CONN_STR"] = "postgresql://postgres:postgres@localhost:5432/postgres"

document_store = PgvectorDocumentStore(search_strategy="hnsw", vector_function="cosine_similarity", recreate_table=True)

query_embedding = [0.1] * 768
most_similar_embedding = [0.8] * 768
second_best_embedding = [0.8] * 700 + [0.1] * 3 + [0.2] * 65
another_embedding = rand(768).tolist()

docs = [
    Document(content="Most similar document (cosine sim)", embedding=most_similar_embedding),
    Document(content="2nd best document (cosine sim)", embedding=second_best_embedding),
    Document(content="Not very similar document (cosine sim)", embedding=another_embedding),
]

document_store.write_documents(docs)

retriever = PgvectorEmbeddingRetriever(document_store=document_store)

print(retriever.run(query_embedding=query_embedding))

# {'documents': [Document(id=..., content: 'Most similar document (cosine sim)', score: 1.0, embedding: vector of size 768), Document(id=..., content: '2nd best document (cosine sim)', score: 0.9745216149948613, embedding: vector of size 768), Document(id=..., content: 'Not very similar document (cosine sim)', score: 0.8717324925233269, embedding: vector of size 768)]}
  1. Not using index This might be possible, as you underlined. To compute cosine similarity, we are following Pgvector docs: https://github.com/pgvector/pgvector?tab=readme-ov-file#distances. If you have a better way to achieve the same result, feel free to share it.

anakin87 avatar Aug 20 '25 15:08 anakin87

Thanks for you reply @anakin87 . That is really weird, it will be hard to give you a real example as it is internal enterprise data. Do we have the same version of haystack packages and pgvector extension ? Yep indeed the compute of the similarity is like that but the issues comes from :

The query needs to have an ORDER BY and LIMIT, and the ORDER BY must be the result of a distance operator (not an expression) in ascending order.

The workaround that we did while waiting for the answer and as we knew we were only going to use cosine_similairty was to override PgvectorDocumentStore._check_and_build_embedding_retrieval_query :

        # we always want to return the most similar documents first
        # so when using l2_distance and cosine_similarity the sort order must be ASC
        sort_order = "ASC" if vector_function in ["l2_distance", "cosine_similarity"] else "DESC"
        order_by = "score" if vector_function != "cosine_similarity" else f"(embedding <=> {query_embedding_for_postgres})"

        sql_sort = SQL(" ORDER BY {order_by} {sort_order} LIMIT {top_k}").format(
            top_k=SQLLiteral(top_k),
            sort_order=SQL(sort_order),
            order_by=SQL(order_by),
        )

        sql_query = sql_select + sql_where_clause + sql_sort

        return sql_query, params

With this we obtain the scores and use the indexes. We also respect the troubleshooting from pgvector to not hae an expression.

We tried the given sql query directly on pgvector and have the same results (no score without the modifications of the query and score with them).

plopezamaya avatar Aug 20 '25 16:08 plopezamaya

OK, thanks for the clarification. Still strange that scores are not returned, but I think that we should improve our code to make sure we use the index.

anakin87 avatar Aug 20 '25 16:08 anakin87