[BUG] Postgres Vector Store inverted similarity calculation
Describe the bug When using Postgres as the Vector Store for a flow and querying it with a specified minimum score, it's not returning anything. Searching the code, I noticed that the Postgres node calculates the distance between the vectors and returns them sorted ascending, but the VectorStore to Document node expects the number to be a similarity value, not the distance, resulting in it discarding the most relevant results for my query. I'm making a local change to try changing this value returned by Postgres to be "1 - distance", which seems to be enough to fix this situation. If it doesn't bring other side effects, I can also make a PR for this little change.
To Reproduce Steps to reproduce the behavior:
- Create a flow
- Configure a Postgres database server
- Add the (Vector Store) Postgres node, with output set as "Postgres Vector Store" and set the connection up
- Add the VectorStore to Document node and set its input as being the Postgres node's output
- Pass "{{question}}" as the Query attribute of the VectorStore to Document node
- Add a simple custom function as an ending node and set the VectorStore to Document node's output as an input variable for the function
- Make the custom function only return the provided variable (for debugging purposes)
- Use the Upsert API to insert data into the vector store for at least two different messages (questions)
- Ask the chat one of the questions that were inserted into the Vector Store
Expected behavior The output should bring the stored vectors ordered by the most similar to the question, but instead it brings them ordered by the less similar. If a minimum score is passed to the VectorStore to Document node, for the exact same question, no result is brought from the query.
Screenshots
The results when querying without setting the minimum score (it brings all entries):
The results when querying with a minimum score of 80% (it doesn't bring any data):
The VectorStore documents' log with the calculated similarity for this case (bringing the exact same question with a score of 0 while bringing non-related questions with greater scores - actually, greater distances):
Setup
- Installation [e.g. docker,
npx flowise start,pnpm start]: Docker - Flowise Version [e.g. 1.2.11]: 1.6.2
- OS: [e.g. macOS, Windows, Linux]: macOS
- Browser [e.g. chrome, safari]: Chrome
Additional context
As mentioned before, it seems the Postgres node is calculating the distance instead of the similarity when querying the vector store. I'm currently testing changing this calculation to bringing "1 - distance" as the similarity score (or changing it directly on the query calculation), but I'm not aware about possible side effects this could cause, since I'm working with Flowise for just 4 days and am not very familiar to its resources.
I'd like to confirm this issue before opening a pull request to fix it.
The postgres vs in Flowise is using implementation from here and yes its using distance
There's another way to do this via here, and this allow you to do cosine, innerProduct or euclidean
We can change the implementation to use the latter one if that solve the issue
Hi @HenryHengZJ, thanks for your answer.
I'm not sure I get your point. Do you mean the distanceStrategy attribute? If so, there are two doubts that just raised about it:
- Am I able to specify the strategy through the flow builder?
- And does changing the strategy solve the situation where we calculate the
distancebut in the flow we specify the minimumsimilarityinstead?
The issue was actually regarding the fact that we pass a minimum similarity score to the block in the flow, but we compare it to the distance instead (which, instead of "the bigger, the better" is "the smaller, the better"), making the "minimum score" input not having the expected behavior.
So, as I'm not very familiar with these distance strategies yet, do you think changing it would solve this situation or would it be better to actually change how the query is built to consider the similarity instead of the distance?
To add to this, calculating cosine similarity on the exact same vector does not give a score of 1, but close to a score of 0.5
Is there a follow up on this?
I would be interested to also be able to select the metric to fetch vectors from the pgvector database.