Flowise icon indicating copy to clipboard operation
Flowise copied to clipboard

[BUG] Postgres Vector Store inverted similarity calculation

Open vinibs opened this issue 1 year ago • 5 comments

Describe the bug When using Postgres as the Vector Store for a flow and querying it with a specified minimum score, it's not returning anything. Searching the code, I noticed that the Postgres node calculates the distance between the vectors and returns them sorted ascending, but the VectorStore to Document node expects the number to be a similarity value, not the distance, resulting in it discarding the most relevant results for my query. I'm making a local change to try changing this value returned by Postgres to be "1 - distance", which seems to be enough to fix this situation. If it doesn't bring other side effects, I can also make a PR for this little change.

To Reproduce Steps to reproduce the behavior:

  1. Create a flow
  2. Configure a Postgres database server
  3. Add the (Vector Store) Postgres node, with output set as "Postgres Vector Store" and set the connection up
  4. Add the VectorStore to Document node and set its input as being the Postgres node's output
  5. Pass "{{question}}" as the Query attribute of the VectorStore to Document node
  6. Add a simple custom function as an ending node and set the VectorStore to Document node's output as an input variable for the function
  7. Make the custom function only return the provided variable (for debugging purposes)
  8. Use the Upsert API to insert data into the vector store for at least two different messages (questions)
  9. Ask the chat one of the questions that were inserted into the Vector Store

Expected behavior The output should bring the stored vectors ordered by the most similar to the question, but instead it brings them ordered by the less similar. If a minimum score is passed to the VectorStore to Document node, for the exact same question, no result is brought from the query.

Screenshots The results when querying without setting the minimum score (it brings all entries): Screenshot 2024-04-16 at 11 30 55

The results when querying with a minimum score of 80% (it doesn't bring any data): Screenshot 2024-04-16 at 11 31 28

The VectorStore documents' log with the calculated similarity for this case (bringing the exact same question with a score of 0 while bringing non-related questions with greater scores - actually, greater distances): Screenshot 2024-04-16 at 11 33 30

Flow sql-test Chatflow.json

Setup

  • Installation [e.g. docker, npx flowise start, pnpm start]: Docker
  • Flowise Version [e.g. 1.2.11]: 1.6.2
  • OS: [e.g. macOS, Windows, Linux]: macOS
  • Browser [e.g. chrome, safari]: Chrome

Additional context As mentioned before, it seems the Postgres node is calculating the distance instead of the similarity when querying the vector store. I'm currently testing changing this calculation to bringing "1 - distance" as the similarity score (or changing it directly on the query calculation), but I'm not aware about possible side effects this could cause, since I'm working with Flowise for just 4 days and am not very familiar to its resources. I'd like to confirm this issue before opening a pull request to fix it.

vinibs avatar Apr 16 '24 14:04 vinibs

The postgres vs in Flowise is using implementation from here and yes its using distance

There's another way to do this via here, and this allow you to do cosine, innerProduct or euclidean

We can change the implementation to use the latter one if that solve the issue

HenryHengZJ avatar Apr 20 '24 11:04 HenryHengZJ

Hi @HenryHengZJ, thanks for your answer. I'm not sure I get your point. Do you mean the distanceStrategy attribute? If so, there are two doubts that just raised about it:

  • Am I able to specify the strategy through the flow builder?
  • And does changing the strategy solve the situation where we calculate the distance but in the flow we specify the minimum similarity instead?

The issue was actually regarding the fact that we pass a minimum similarity score to the block in the flow, but we compare it to the distance instead (which, instead of "the bigger, the better" is "the smaller, the better"), making the "minimum score" input not having the expected behavior. So, as I'm not very familiar with these distance strategies yet, do you think changing it would solve this situation or would it be better to actually change how the query is built to consider the similarity instead of the distance?

vinibs avatar Apr 30 '24 12:04 vinibs

To add to this, calculating cosine similarity on the exact same vector does not give a score of 1, but close to a score of 0.5

Weilin37 avatar Jun 26 '24 19:06 Weilin37

Is there a follow up on this?

PolygonHealth avatar Aug 21 '24 20:08 PolygonHealth

I would be interested to also be able to select the metric to fetch vectors from the pgvector database.

Astriel avatar Sep 03 '24 10:09 Astriel