pgvector
pgvector copied to clipboard
Hnsw iterator
See #259 for motivation of this PR.
In few words - if we wan to perform filtering after HNSW search, then K=hnsw_ef_search
may be not enough to provide enough results to upper plan nodes.
Assume hnsw_ef_seach=10
.
If we perform query:
select * from documents order by embedding<->array[...]::vector limit 10;
then it returns 10 result and it is ok.
If with the same hnsw_ef_seach=10
. we execute query:
select * from documents order by embedding<->array[...]::vector limit 100;
and it still returns 10 results it is no so good. But at least here we can easily changehnsw-ef-search
to 100 and get proper number of results.
But if we do
select * from documents where category in (1,2,3) order by embedding<->array[...]::vector limit 10;
and it returns 0 results, then we do not know whether there are actually no documents belonging to this categories or them just too far from specified point. And it is not clear how to increase hnsw_ef_search
to return proper number of results.
There are three possible solutions of the problem:
- Push down filter condition to SearchKNN method and perform filtering before cutting K results. Unfortunately Postgres index AM doesn't allow to do it.
- Repeat search with larger
ef_search
. This is what I have implemented in my previous PR: #266. This solution has two drawbacks:
- double amount of works caused by query retry
- not strict order of returns results
- Implement iterator which usable to return all elements in increasing distance order.
Approach 3 I tried to implement in this PR. It is similar with how KNN is implemented in Postgres for GIST indexes. The main difference between GIST and HNSW is that first one is using wrapping rectangles which allows to cut traverse tree without sorting all elements. If we have wrapping rectangles A=[(ax1,ay1),(ax2,ay2)] and B=[(bx1,by1),(bx2,by2)] and search for points closer to (x,y), then if x<ax1 and ax2<bx1 and y<ay1 and ay2<by1, then we can traverse first all siblings of A and then all siblings of B.
Unfortunately fir HNSW it doesn't work: if we have two vectors V1, V2 and search vector V and V1<->V < V2<->V
(distance between V1 and V is smaller than distance between V2 and V), then it doesn't mean that the same is true for all neighbours of V1 and V2. So unlike GIST we can not provide strict order, in other words we can return vector with larger distance prior to vector with smaller distance. But as far as our search is approximate in any case, it seems to be ok.
How HNSW search was implemented before? We recursively traverse vertexes at each layer, marking them as visible and rejecting candidates with distance large than already chosen. The search is repeated for each layer. For intermediate layers ef=1
(which means that we choose only one candidate) and for leaf layer = ef=hnsw_ef_search
. For each layer we use hash for marking listed vertexes and two pairing heaps: for maintaining sorted sets of candidates and results.
Iterator need to keep state of recursive traversal of all layers. So we create stack (array) with number of elements equal to number of layers and each element contains hash and 2 pairing heaps. We can not reject elements with larger distance, because them have to be traversed I any case. We just put all of them in pairing heap. To prevent memory overflow we stop traversal if we have found enough results (1 for internal layer and hnsw_ef_search
for leaf layer) and there are no more candidates at distance less or equal than distance to the n-th already chosen result.
Similar think was already implemented in TImescaleDB HNSW index implementation: https://news.ycombinator.com/item?id=37648913
Also related links: https://medium.com/pinterest-engineering/manas-hnsw-streaming-filters-351adf9ac1c4 https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c
@knizhnik Can you please provide a summary/description of what this patchset does? I didn't see anything in the commit messages, and I want to evaluate it holistically. Thanks!
Are there any tests/benchmarks comparing this updated method to the older method? How does this impact performance/recall?
@knizhnik Can you please provide a summary/description of what this patchset does? I didn't see anything in the commit messages, and I want to evaluate it holistically. Thanks!
Done
Are there any tests/benchmarks comparing this updated method to the older method? How does this impact performance/recall?
Results on OpenAI dataset (1M embeddings with 1636 dimension).
The query is SELECT _id, openai_vector <-> ARRAY[...]::vector as dist from documents order by dist LIMIT 10;
postgres=# set hnsw.enable_iterator = on;
SET
Time: 0.136 ms
postgres=# \i ~/ann-benchmarks/openai-pgvector.sql
_id | dist
------------------------------------------------+--------------------
<dbpedia:Paris> | 0.5149979680215692
<dbpedia:Economy_of_Paris> | 0.5322160352313228
<dbpedia:List_of_tourist_attractions_in_Paris> | 0.5556925144520639
<dbpedia:Capitale-Nationale> | 0.560634247353793
<dbpedia:Besançon> | 0.5639481182838457
<dbpedia:Captal> | 0.5744128256080486
<dbpedia:Palais_de_Justice,_Paris> | 0.5744524110550726
<dbpedia:Laon> | 0.5754102943098868
<dbpedia:Hôtel_de_Ville,_Paris> | 0.5779674934070351
<dbpedia:Timeline_of_Paris> | 0.5780753294768416
(10 rows)
Time: 4.177 ms
postgres=# set hnsw.enable_iterator = off;
SET
Time: 0.110 ms
postgres=# \i ~/ann-benchmarks/openai-pgvector.sql
_id | dist
------------------------------------------------+--------------------
<dbpedia:Paris> | 0.5149979680215692
<dbpedia:Economy_of_Paris> | 0.5322160352313228
<dbpedia:List_of_tourist_attractions_in_Paris> | 0.5556925144520639
<dbpedia:Capitale-Nationale> | 0.560634247353793
<dbpedia:Besançon> | 0.5639481182838457
<dbpedia:Captal> | 0.5744128256080486
<dbpedia:Palais_de_Justice,_Paris> | 0.5744524110550726
<dbpedia:Laon> | 0.5754102943098868
<dbpedia:Hôtel_de_Ville,_Paris> | 0.5779674934070351
<dbpedia:Timeline_of_Paris> | 0.5780753294768416
(10 rows)
Time: 3.545 ms
Does this produce different results than current 'master', if you stop the iteration so that you get the same number of results?
Does this produce different results than current 'master', if you stop the iteration so that you get the same number of results?
Right now results are identical.
I want to bring up this topic that I also wrote in a comment on my branch:
Currently on 'master', 'ef' parameter controls how many candidates to return. In the new continuous or iterator mode, 'ef' controls how many candidates to maintain in the pool at all times. For the same index and parameters, the first candidate to return is the same in both modes, but after that, the continuous mode can yield slightly different results. The results in continuous mode are always at least as good as in the one-shot mode, but because we "top up" the pool of candidates as we go, we might return elements that are closer to the query vector, after we have already returned some farther away elements.
In other words, in one-shot mode, we always return 'ef' elements in the right order, but because this is aNN, they are not necessarily the true nearest elements. In continuous mode, we return all elements if you keep iterating, but they might not be in the right order.
That can be surprising, but I think it's fine if you're just doing a simple aNN search. However, the planner doesn't know that the ordering is approximate. That can affect the rest of the plan, for example:
prepare surprise(vector) as
select a.id, b.id, a.embedding <=> $1
from items_small a, items_small b
where (a.embedding <=> $1) = (b.embedding <=> $1)
ORDER BY (a.embedding <=> $1)
limit 1000;
postgres=# execute surprise('[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]');
ERROR: mergejoin input data is out of order
The plan is:
postgres=# explain execute surprise('[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=65.20..89.29 rows=1000 width=16)
-> Merge Join (cost=65.20..424656.30 rows=17625172 width=16)
Merge Cond: ((a.embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector) = (b.embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector))
-> Index Scan using items_small_embedding_idx on items_small a (cost=32.60..13822.32 rows=59372 width=412)
Order By: (embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector)
-> Materialize (cost=32.60..13970.75 rows=59372 width=412)
-> Index Scan using items_small_embedding_idx on items_small b (cost=32.60..13822.32 rows=59372 width=412)
Order By: (embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector)
(8 rows)
I'm not sure what to think of this. The planner doesn't have a concept of an approximate sort order. Perhaps we could drop the information that a path is sorted after filtering has been done, but I'm afraid that would prevent the index scan from being chosen in a lot of scenarios that would be fine. The current situation, without this PR, where the index scan doesn't return all the rows in the table also violates the planner's assumptions, just differently, and it's perhaps less likely to throw errors :shrug:.
I want to bring up this topic that I also wrote in a comment on my branch:
Another type of query which can suffer from order violation is group by:
explain select string_agg(_id,',') from (select _id,openai_vector <-> array[1,2,3]::vector as dist from documents order by dist limit 1000) group by dist;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=645.20..2711.52 rows=200 width=40)
Group Key: ((documents.openai_vector <-> '[1,2,3]'::vector))
-> Limit (cost=645.20..2704.02 rows=1000 width=37)
-> Index Scan using documents_openai_vector_idx on documents (cost=645.20..2059465.20 rows=1000000 width=37)
Order By: (openai_vector <-> '[1,2,3]'::vector)
(5 rows)
Frtankly speaking I do not have any good proposal of solving this problem. Both series seems to be very artificial and unlikely that any user will be faced with this problem. The concept of approximate search contradicts SQL standard fro the very begging: result of query plan using index may be different from query plan not using index. If we are not able to provide precise results, then order of this results seems to be less critical problem, doesn't it?
I want to bring up this topic that I also wrote in a comment on my branch:
Another type of query which can suffer from order violation is group by:
explain select string_agg(_id,',') from (select _id,openai_vector <-> array[1,2,3]::vector as dist from documents order by dist limit 1000) group by dist; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ GroupAggregate (cost=645.20..2711.52 rows=200 width=40) Group Key: ((documents.openai_vector <-> '[1,2,3]'::vector)) -> Limit (cost=645.20..2704.02 rows=1000 width=37) -> Index Scan using documents_openai_vector_idx on documents (cost=645.20..2059465.20 rows=1000000 width=37) Order By: (openai_vector <-> '[1,2,3]'::vector) (5 rows)
Frtankly speaking I do not have any good proposal of solving this problem. Both series seems to be very artificial and unlikely that any user will be faced with this problem. The concept of approximate search contradicts SQL standard fro the very begging: result of query plan using index may be different from query plan not using index. If we are not able to provide precise results, then order of this results seems to be less critical problem, doesn't it?
Yeah, equality on distance is a bit dubious. It's a float that gets rounded. That makes both my merge join and your group by example unrealistic.
A more realistic case would be MergeAppend, something like:
prepare mystmt(vector) as
(select id, embedding <=> $1 from items_small
union all
select id, embedding <=> $1 from items
)
order by 2;
postgres=# explain execute foo('[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Append (cost=10000220245.31..10000249571.04 rows=1242881 width=12)
Sort Key: ((items_small.embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector))
-> Index Scan using items_small_embedding_idx on items_small (cost=32.60..13970.75 rows=59372 width=12)
Order By: (embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector)
-> Sort (cost=10000220212.70..10000223171.47 rows=1183509 width=12)
Sort Key: ((items.embedding <=> '[-0.31533,0.25417,0.55528,-0.29229,-0.0027537,-1.1689,0.17151,0.060029,0.62897,0.17066,-0.14748,-0.31574,1.0416,-0.017593,-0.053196,0.62312,0.6691,-0.77713,1.4983,0.28086,-0.62815,0.10885,0.74758,-0.24324,0.3955,0.55461,-0.052652,-0.18485,0.51246,0.19892,0.00013484,-0.25888,-0.11083,-0.063943,-0.71447,0.054546,0.96663,0.31343,-0.3033,1.0454,0.82853,0.34762,-0.17433,0.46479,4.1116e-05,0.11956,0.17235,-0.48861,-0.29915,0.7081,0.10253,-0.27391,-1.1292,-0.69647,-0.58459,0.13411,0.27188,-0.47946,0.093771,-0.69395,0.074337,0.20074,0.12972,0.51127,-0.013941,-0.22493,0.80425,0.87931,0.33713,0.22308,-0.62297,-0.4134,0.60944,0.0014924,0.3207,0.26836,0.3099,0.076108,-0.031559,-0.26954,-0.59395,0.098534,0.29471,-0.39031,0.12679,-0.42897,0.20498,-0.60429,-0.27911,-0.44967,0.34115,-0.24527,0.02525,0.54558,0.63618,0.38963,0.10122,0.81177,-0.15689,0.034974]'::vector))
-> Seq Scan on items (cost=10000000000.00..10000080601.86 rows=1183509 width=12)
JIT:
Functions: 6
Options: Inlining true, Optimization true, Expressions true, Deforming true
(10 rows)
That happens to work, though. MergeAppend doesn't do the same sanity check that MergeJoin does. So maybe we're all good; the rule would be that equality on distance is dubious, so if you try to do that, you might get errors. Other operations that depend on ordering work as much as you would expect, given that the ordering is approximate to begin with.
Yeah, equality on distance is a bit dubious.
Sure. Originally I tried (dist * 100)::integer
which may have some sense but Postgres is not smart enough to understand that the order is preserved after such transformation and uses hash join in this case.
That happens to work, though.
Sorry, but why UNION ALL
and not UNION
?
UNION ALL
doesn't remove duplicates so for it order is not important.
To address problem with order violation I have added hnsw.strict_order
GUC.
When set, it just skips results which violates distance ascending order.
So will will get not all results, but order is not violated and no query execution can cause error.
By default it is switched off, because I see that cases where query order is important are quite exotic.
That happens to work, though.
Sorry, but why
UNION ALL
and notUNION
?UNION ALL
doesn't remove duplicates so for it order is not important.
The query has an ORDER BY. The planner chooses to implement that with MergeAppend over already-sorted outputs from two branches of the UNION ALL.
@ankane do you have any thoughts on this? In particular, what do you think of returning rows in "wrong" order vs omitting rows that would be in wrong order?
Hi @knizhnik, thanks for the PR. It's an interesting approach, but want to explore potentially more efficient methods like HQANN for filtered search.
For returning results out-of-order: It's definitely not ideal, but will likely happen with scalar/product quantization as well (and will likely cause confusion since it's not intuitive). Ideally Postgres would have an option to re-rank results from the heap (that access methods / index scans could set), but as far as I know, there's not a way to currently do this.
Hi @knizhnik, thanks for the PR. It's an interesting approach, but want to explore potentially more efficient methods like HQANN for filtered search.
HQANN is interesting. Filtered-DiskANN is a similar paper but with more thorough experiments IMHO. It's based on Vamana but I think the results would be similar with HNSW.
A problem with that approach is that it's only applicable to equality quals (e.g. col = 'foo'
). Those are common, but there's a long tail of other queries where it doesn't help. So if we can improve the post-filtering, that would still be useful even if we had HQANN.
For returning results out-of-order: It's definitely not ideal, but will likely happen with scalar/product quantization as well (and will likely cause confusion since it's not intuitive). Ideally Postgres would have an option to re-rank results from the heap (that access methods / index scans could set), but as far as I know, there's not a way to currently do this.
Ok. Re-ranking could indeed reduce the confusion to applications. It wouldn't help within the executor though; even if you sort the final result set, the intermediate results would still be in wrong order, which can confuse e.g. merge joins or group by. You usually couldn't force re-sorting the intermediate results because you don't know beforehand how many tuples you need from the intermediate node. I think we need to either:
a) Accept weird results if the planner chooses a plan that depends on the ordering, like a merge join, b) Omit the out-of-order results from the result set altogether, or c) Introduce the concept of "approximate order" to the planner. If a result set is in approximate order, it's good enough to satisfy an ORDER BY requested by the user, but not for merge joins and such. (And optionally, we could re-sort the final result, if it's in approximate order)
I don't think this is a huge problem in practice. When someone is using a pgvector index, they've already accepted that the results are approximate. Nevertheless it would be nice to have a coherent story on what kind of errors can happen.
FWIW, I'd like to voice support for this or any other addition that allows the combination of ANN with (mostly) arbitrary queries, would be useful to my company's use cases.