lucene Explore bypassing HNSW graph building for tiny segments

Description

For both quantized, and non-quantized vectors, if there are many flushes with tiny segments, it may not be worth building the HNSW graph.

For example, we already do a brute-force query against the segment if the k requested is larger than the number of vectors in the segment. So, having the graph means nothing at query time and is unnecessary work.

In my mind, this would behave as follows (given some empirically determined threshold N):

For tiny segments, we would only scalar quantize, or just store in a flat index (if its a non-quantizing format)
When merging tiny segments, we determine if the total number of vectors is > N, and if so, we start building the graph
During indexing, we don't construct the HNSW graph builder until we have N vectors, then we construct the builder, replay the seen vectors, and handle new docs.

N needs to be determined empirically, the HNSW codecs already have so many configuration items, it would be very nice to simply pick one where the trade-off makes most sense. I would suspect it is somewhere around 10k.

Also, would N take into account if Panama Vector is enabled and if things are quantized or not?

Possibly, we may want to consider the scenario where Panama Vector is disabled, though, I am not sure I would suggest anyone using vector search with it off :/.

Jun 03 '24 11:06 benwtrent

Also, would N take into account if Panama Vector is enabled and if things are quantized or not?

FWIW I'd optimize for simplicity over picking the perfect heuristic. As an example, IndexOrDocValues applies an arbitrary 8x penalty to doc values and only uses doc values when it would result in 8x fewer comparisons than running the query against the points index. This 8x number is made up, it only tries to account for the fact that comparisons are a bit cheaper on the points index than on doc values. But I like that it's simple and still makes good decisions more often than wrong decisions.

Maybe we could do something similar here and only build a HNSW graph if doing a top-1000 search would visit less than 1/8th of the documents that have a vector.

Jun 04 '24 08:06 jpountz

Maybe we could do something similar here and only build a HNSW graph if doing a top-1000 search would visit less than 1/8th of the documents that have a vector.

It looks like the new expectedVisitedNodes heuristic that you introduced in #14836 could be used for that purpose?

Jun 25 '25 12:06 jpountz

I would suspect it is somewhere around 10k.

Interestingly, it looks like your intuition roughly aligns with my suggestion if using topK=100. expectedVisitedNodes(100, 10_000) = 921 ~= 1250 = 10_000 / 8

Jun 25 '25 12:06 jpountz

Interestingly, it looks like your intuition roughly aligns with my suggestion if using topK=100

LOL, your intuition is better than mine! It took me many experiments to arrive at that estimation!

Yeah, I think the expectedVisitedNodes can be used towards this purpose. We pick a "standard k" (or allow it to be configurable?? maybe that is too many knobs.), apply it and determine if hnsw is worth it.

I would expect this to improve indexing throughput without too much of a hit at query time (especially for highly quantized vectors where vector comparisons are very cheap).

Jun 25 '25 12:06 benwtrent

allow it to be configurable?? maybe that is too many knobs

I worry about too many knobs too, I'd hardcode it.

I would expect this to improve indexing throughput without too much of a hit at query time

++, especially in the NRT case (e.g. 1s refresh interval) when tiny segments get flushed very frequently

Jun 25 '25 20:06 jpountz

I liked the idea and was playing around with this idea/change. l'll put up a PR for this soon. Also, do we know how we want to this change? Maybe setting lower buffer size to trigger frequent flushes and run luceneutil to see if there in some impact on the overall indexing time and possibly some minor improvement to recall?

Jul 16 '25 05:07 shubhamvishu

We have StoredFieldsBenchmark to test the impact of NRT (frequent small flushes, frequent small merges) on stored fields, we could write something similar for vectors.

Jul 16 '25 08:07 jpountz

@jpountz Sure, makes sense

Jul 17 '25 13:07 shubhamvishu

I opened #14963 for this. Looking for reviews and suggestions. Thanks!

Jul 17 '25 16:07 shubhamvishu

I'm concerned that we may now have lost test coverage of the complicated vector format merging code since in most tests we never build an HNSW graph. Is there an easy way to opt out of this when testing? We should probnably randomize and/or disable this completely for the tests that are explicitly focused on HNSW graph creation

Nov 12 '25 20:11 msokolov

@msokolov We disabled this on those tests(related to hnsw graph creation and failing) in the PR for this same reason.

Is there an easy way to opt out of this when testing?

We disabled this for those tests by disabling on the codec they are using. Example :

new IndexWriterConfig()
              .setCodec(
                  TestUtil.alwaysKnnVectorsFormat(
                      new Lucene99HnswVectorsFormat(DEFAULT_MAX_CONN, DEFAULT_BEAM_WIDTH, 0)));

Nov 12 '25 20:11 shubhamvishu

There is one place I see in HnswGraphTestCase that we might loose coverage since the no. of nodes would not let the assert. Maybe we should do it there too or just increase the num of nodes.

Nov 12 '25 20:11 shubhamvishu

Got it - I was thinking of TestLucene99HnswVectorFormat.testRandomWithUpdatesAndGraph -- I was debuggiong and it didn't ever seem to be merging any HNSW graphs

Nov 12 '25 20:11 msokolov

I was thinking of TestLucene99HnswVectorFormat.testRandomWithUpdatesAndGraph

The issue is its by default switched on for the Codec. So for tests that pass there own codecs or use methods like getDefaultCodec this optimization automatically kicks in unless we explicitly disable on those. I opened to https://github.com/apache/lucene/pull/15419 to disable it for one test I saw but I can include testRandomWithUpdatesAndGraph (or if any more) also.

Nov 12 '25 21:11 shubhamvishu

One thing we can do is to have RandomCodec sometimes set the threshold very high. That way tests run as part of TestPerFieldKnnVectorsFormat will sometimes test with and sometimes without this optimization

Nov 13 '25 12:11 msokolov