HNSW BP reorder tool
This is a working tool for re-ordering an index using BP (binary partitioning) over a vector field. It will renumber an existing HNSW graph over the field without the need to re-index it. There are a bunch of TODOs to be worked out, but the results have been promising so far so I thought I should share it in its current state.
Highlights:
On a 2M-vector index it is able to compress the vex file more than 30% (from 89M to 61M). Additionally, query latency when searching the index dropped by 20% (from 0.7 ms to 0.56ms) with no loss in precision/recall.
Limitations:
The current tool only works with floating-point vectors. It doesn't provide any mechanism to maintain an index sort that uses BPV, nor is there a merge policy.
I've tested with Euclidean distance only. In theory this will work for other distances as well, but I'm not fully convinced that grouping vectors by averaging (centroid) is the right thing to do for the angular distances. I think it should work OK though, just haven't had time to test.
Other:
To make the reordering efficient I needed to expose a few details of the HNSW index: I added KnnVectorReader.getGraph() and HnswGraph.maxConn(). In the past we've chosen not have the graph exposed by the vector reader since it is seen as an internal detail that is tied to a specific ANN implementation. The compromise I made here is that the method can return null. The alternative is to make Lucene99HnswVectorReader (and maybe some other classes?) be an HnswGraphProvider have class cast checks for HnswGraphProvider - I think a null check is cleaner, and with this change I was able to remove a bunch of casting.
I enhanced SortingCodecReader to be able to sort vector values off-heap in case they are dense.This could be extended to the sparse case as well.
Future ideas:
To make this operationally-useful I think we need at a minimum the merge policy, but I think ultimately it belongs in the codec. I would like to make it compatilbe with a different (index) sort over docids. This would requires changes to the way nodes and docids are associated by the codec, and because of the way our vector codecs have different codecs for every possible combination of options (Byte vs Float, quantized vs not) this will be a lot of code changes, and I'd like to get the core feature working well before tackling that.
Also: try group-varint encoding for the graph neighbors to make better use of the reduced deltas?
addresses https://github.com/apache/lucene/issues/13565
because of the way our vector codecs have different codecs for every possible combination of options (Byte vs Float, quantized vs not) this will be a lot of code changes
I'm not the most familiar one with our vector file formats, but my understanding is that we already maintain a node ID <-> doc ID mapping, e.g. to retain dense node IDs in the case when not all docs have a vector. Could it "simply" be a matter of allowing this mapping to not be monotonic?
I'm not the most familiar one with our vector file formats, but my understanding is that we already maintain a node ID <-> doc ID mapping, e.g. to retain dense node IDs in the case when not all docs have a vector. Could it "simply" be a matter of allowing this mapping to not be monotonic?
Yes I think so. I guess we'd also have to decide whether to maintain both formats given that it would be more efficient to maintain the monotonic mapping in cases where we don't need the full indirection (eg we didn't apply BP).
Updated to fix a few boneheaded mistakes, added support for off-heap sorting of sparse vector values.
I'd like to get this tool merged before moving on to thinking about different index encodings. To that end I plan to ensure that it works with all the vector similarities, and with byte-encoded and quantized vector encodings, and then open separate issues for the other things mentioned above.
Hmm, I realized that the clever casting trick I used in this version (casting OffHeapHnswVectorValues to RandomAccessVectorValues.Floats) won't work with the quantized vector values flavors, and doing this cleanly will require moving to a first-class random access API. So I think it may be better to get this merged in more or less as is (I have some small tweaks for DOT_PRODUCT) and handle any such changes separately.
I'd like to get this tool merged
Some API changes are a bit annoying, e.g. the fact that KnnVectorsReader now has an API that is specific to HNSW. I'm sorry as I know it makes iterating a bit more challenging, but I'd rather like to avoid trading API cleanliness for this feature?
Well, I don't mind just leaving this on a branch/PR, but I didn't see a better alternative; I talked about one other possibility in the comments above, but this seemed the cleanest to me.
I can bring back HnswGraphProvider with all of its casts and instanceof checks, maybe that provides a fig leaf
I honestly thought it was a mistake, but now I realize the hiding of getGraph() was intentional! I restored it now. I think this should be OK to merge?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
this was superseded by #14097