Should we rewrite/optimize the HNSW graph in 2nd pass?
I was talking to @msokolov just now and this random idea came up... this is just brainstormy/speculation idea at this point:
The HNSW graph is built incrementally: for each new vector to insert, we run a topK search, only seeing all previously added vectors, then add a new node connecting to those top K (we might also prune existing nodes connections to keep diversity). We do this one by one (well, concurrently) ... but this means (maybe?) vectors added early can have poor-ish neighbors since they didn't see all the future vectors yet. But, when future vectors are added, when they are close to an already-added vector, they will also add the return transition (from old node to new node), so this sort of makes up for that old node not seeing future vectors yet?
Anyway, it made me wonder whether a 2nd phase through all the vectors, with the "benefit of hindsight", might somehow yield improvements (smaller graph, faster/better search-time recall curve)?
I think @msokolov mentioned that the original HNSW paper had talked about this behavior but it was somehow OK / a benefit / not a problem?
If we were to plot "average node neighbor distance" as a function of when the node was added, I wonder if we'd see a correlation of lower quality neighbors for earlier added vectors? Or, is the bidirectional linking completely offsetting this bias?
Anyway, it made me wonder whether a 2nd phase through all the vectors, with the "benefit of hindsight", might somehow yield improvements (smaller graph, faster/better search-time recall curve)?
I think this would be an interesting experiment. Basically using its existing connections as "entry points" and doing the regular insertion logic. Very similar to the new fix up logic added in this PR: https://github.com/apache/lucene/pull/15003
All that said, I am not sure it would be beneficial (as you said). However, hnsw graphs are directional, it is possible there are weird cases where a node A would be near new node B, but never gets the back-link because the new B node doesn't consider node A one of its nearest neighbors.
This sort of smells like the "connected components" thing, but likely much cheaper.
Very similar to the new fix up logic added in this PR: https://github.com/apache/lucene/pull/15003
+1, I wonder if that fixup logic (which currently only kicks in on the "swiss cheese" case where an HNSW graph has holes because vectors were deleted) could also just run on an ordinary HNSW graph without holes? Or does it specifically target the holes as the things needed fix-me-up attention?
Maybe a simple offline one-off experiment/prototype to try: build the HNSW graph like we do today. But then build another HNSW graph where you insert the same vectors in the reverse order. Then merge the two graphs (map each same node from the two graphs onto each other, taking union of all edges). Hmm do we have a merge/union API today on HNSW graph? Then maybe prune a bit (diversity pruning?) to get avg connection count back to "typical", then see if that helps recall/performance tradeoff. Sounds like a lot of work :) But it might let us see if "hindsight" would've been helpful.
Another thing I'm going to hopefully try soon: I worry that the vectors we get for a given corpus (e.g. for Wikipedia Cohere on HuggingFace that we use in Lucene's nightly benchy) might have been pre-sorted in some "interesting" way which might alter results. This sort of sneaky benchmark attack vector is scary to me :) There are so many sources of subtle "fucked up the benchmarking" already! I don't need one more to fret about. The order of documents can make a big difference (e.g. graph bipartite reordering results) and might lead to wrong conclusions in general and steer our development incorrectly, baking echos of that mistake into Lucene's algorithm choices!
So I'm going to try a hopefully A/A test: knnPerfTest.py on the existing Cohere vectors (same order they are in at HuggingFace), then another run after shuffling the vectors. Hopefully those two are nearly the same, under the "standard" HNSW/indexing noise floor (whatever THAT is, I don't know).
@mikemccand @msokolov I was thinking about this, and talking with somebody working in JVector land, and they mentioned some of their merging plans.
During merge, we know ALL the vectors. Incremental updates should really only ever be done during initial flush.
However, during merge, we know all the vectors and have all the previous graphs. So, during initial candidate gathering, why not search the old graphs and translate their ordinals to the new vector ordinals? Then during candidate retrieval, we see everything. There is no longer a need for checking for backlinks for the majority (maybe all?) vectors. In this way, this "2nd pass" thing sort of goes away.
Oooh that sounds great! It would make use of the previous HNSW graphs, whereas today we completely ignore/discard them (except maybe using the largest one as the starting graph for the merged graph)?
And it would mean we indeed automatically get that 2nd pass the first time flushed segments are merged? Very cool! Now I really wonder if this would be needle-moving on the recall vs hnsw graph size / visited nodes / search latency!
We would use Lucene's segment merging architecture to its advantage, I love it.
@mikemccand I created https://github.com/apache/lucene/issues/15504 to track it. I haven't had time to dig in yet. But it does seem like a fun exploration. Intuitively, it SHOULD provide a really nice benefit.
The idea is definitely intriguing. I'm thinking about how we would assign ordinals in the "new" graph. I'm sure we can come up with something, but ... if we want to retain today's encoding, ordinal ordering is congruent with docid ordering, and thus we require fully-merging all the docs (at least to the extent of skipping deletions and sorting when there is an index sort) so we can know their ordinals in the merged segment. Another idea could be to merge using the existing ordinals and then to rewrite the ordinals when writing the on heap graph to disk.