Explore bypassing HNSW graph building for tiny segments
Description
For both quantized, and non-quantized vectors, if there are many flushes with tiny segments, it may not be worth building the HNSW graph.
For example, we already do a brute-force query against the segment if the k requested is larger than the number of vectors in the segment. So, having the graph means nothing at query time and is unnecessary work.
In my mind, this would behave as follows (given some empirically determined threshold N):
- For tiny segments, we would only scalar quantize, or just store in a flat index (if its a non-quantizing format)
- When merging tiny segments, we determine if the total number of vectors is > N, and if so, we start building the graph
- During indexing, we don't construct the HNSW graph builder until we have N vectors, then we construct the builder, replay the seen vectors, and handle new docs.
N needs to be determined empirically, the HNSW codecs already have so many configuration items, it would be very nice to simply pick one where the trade-off makes most sense. I would suspect it is somewhere around 10k.
Also, would N take into account if Panama Vector is enabled and if things are quantized or not?
Possibly, we may want to consider the scenario where Panama Vector is disabled, though, I am not sure I would suggest anyone using vector search with it off :/.
Also, would N take into account if Panama Vector is enabled and if things are quantized or not?
FWIW I'd optimize for simplicity over picking the perfect heuristic. As an example, IndexOrDocValues applies an arbitrary 8x penalty to doc values and only uses doc values when it would result in 8x fewer comparisons than running the query against the points index. This 8x number is made up, it only tries to account for the fact that comparisons are a bit cheaper on the points index than on doc values. But I like that it's simple and still makes good decisions more often than wrong decisions.
Maybe we could do something similar here and only build a HNSW graph if doing a top-1000 search would visit less than 1/8th of the documents that have a vector.
Maybe we could do something similar here and only build a HNSW graph if doing a top-1000 search would visit less than 1/8th of the documents that have a vector.
It looks like the new expectedVisitedNodes heuristic that you introduced in #14836 could be used for that purpose?
I would suspect it is somewhere around 10k.
Interestingly, it looks like your intuition roughly aligns with my suggestion if using topK=100. expectedVisitedNodes(100, 10_000) = 921 ~= 1250 = 10_000 / 8
Interestingly, it looks like your intuition roughly aligns with my suggestion if using topK=100
LOL, your intuition is better than mine! It took me many experiments to arrive at that estimation!
Yeah, I think the expectedVisitedNodes can be used towards this purpose. We pick a "standard k" (or allow it to be configurable?? maybe that is too many knobs.), apply it and determine if hnsw is worth it.
I would expect this to improve indexing throughput without too much of a hit at query time (especially for highly quantized vectors where vector comparisons are very cheap).
allow it to be configurable?? maybe that is too many knobs
I worry about too many knobs too, I'd hardcode it.
I would expect this to improve indexing throughput without too much of a hit at query time
++, especially in the NRT case (e.g. 1s refresh interval) when tiny segments get flushed very frequently
I liked the idea and was playing around with this idea/change. l'll put up a PR for this soon. Also, do we know how we want to this change? Maybe setting lower buffer size to trigger frequent flushes and run luceneutil to see if there in some impact on the overall indexing time and possibly some minor improvement to recall?
We have StoredFieldsBenchmark to test the impact of NRT (frequent small flushes, frequent small merges) on stored fields, we could write something similar for vectors.
@jpountz Sure, makes sense
I opened #14963 for this. Looking for reviews and suggestions. Thanks!
I'm concerned that we may now have lost test coverage of the complicated vector format merging code since in most tests we never build an HNSW graph. Is there an easy way to opt out of this when testing? We should probnably randomize and/or disable this completely for the tests that are explicitly focused on HNSW graph creation
@msokolov We disabled this on those tests(related to hnsw graph creation and failing) in the PR for this same reason.
Is there an easy way to opt out of this when testing?
We disabled this for those tests by disabling on the codec they are using. Example :
new IndexWriterConfig()
.setCodec(
TestUtil.alwaysKnnVectorsFormat(
new Lucene99HnswVectorsFormat(DEFAULT_MAX_CONN, DEFAULT_BEAM_WIDTH, 0)));
There is one place I see in HnswGraphTestCase that we might loose coverage since the no. of nodes would not let the assert. Maybe we should do it there too or just increase the num of nodes.
Got it - I was thinking of TestLucene99HnswVectorFormat.testRandomWithUpdatesAndGraph -- I was debuggiong and it didn't ever seem to be merging any HNSW graphs
I was thinking of TestLucene99HnswVectorFormat.testRandomWithUpdatesAndGraph
The issue is its by default switched on for the Codec. So for tests that pass there own codecs or use methods like getDefaultCodec this optimization automatically kicks in unless we explicitly disable on those. I opened to https://github.com/apache/lucene/pull/15419 to disable it for one test I saw but I can include testRandomWithUpdatesAndGraph (or if any more) also.
One thing we can do is to have RandomCodec sometimes set the threshold very high. That way tests run as part of TestPerFieldKnnVectorsFormat will sometimes test with and sometimes without this optimization