lucene
lucene copied to clipboard
Store HNSW graph connections more compactly
Description
HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible.
We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478
Maybe instead we could store the connection list using PackedInts.Writer
. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily.
I quickly tested the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%:
Baseline
155M luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex
Hacky patch
103M luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex
Nice idea, @jtibshirani! Do you have an idea what's the performance of reading this way packed values during search? Will it make searches any slower?
I like the idea, although I confess I don't understand how we can make it fixed width! I guess if we know the max number and it is small, we can quantize more cheaply?
Also, I wonder if we can get some benefit by delta-encoding, because we always read the neighbors in order
My idea: given the total number of vectors in the graph, calculate the minimum number of bits required to represent these and round that up to the nearest byte. For example, the code to write the graph would look something like this:
private void writeGraph(OnHeapHnswGraph graph) throws IOException {
if (graph == null) return;
// write vectors' neighbours on each level into the vectorIndex file
int countOnLevel0 = graph.size();
+ int bitsRequired = PackedInts.bitsRequired(countOnLevel0);
for (int level = 0; level < graph.numLevels(); level++) {
int maxConnOnLevel = level == 0 ? (M * 2) : M;
NodesIterator nodesOnLevel = graph.getNodesOnLevel(level);
while (nodesOnLevel.hasNext()) {
int node = nodesOnLevel.nextInt();
NeighborArray neighbors = graph.getNeighbors(level, node);
int size = neighbors.size();
vectorIndex.writeInt(size);
// Destructively modify; it's ok we are discarding it after this
int[] nnodes = neighbors.node();
Arrays.sort(nnodes, 0, size);
+ PackedInts.Writer packedIntsWriter = PackedInts.getWriterNoHeader(
+ vectorIndex, PackedInts.Format.PACKED, maxConnOnLevel, bitsRequired, 1);
for (int i = 0; i < size; i++) {
int nnode = nnodes[i];
- vectorIndex.writeInt(nnode);
- }
- // if number of connections < maxConn,
- // add bogus values up to maxConn to have predictable offsets
- for (int i = size; i < maxConnOnLevel; i++) {
- vectorIndex.writeInt(0);
+ packedIntsWriter.add(nnode);
}
+ // if number of connections < maxConn, this call automatically adds padding
+ packedIntsWriter.finish();
}
As an example, if a graph had 1 million neighbors, bitsRequired
would only be 20. So with the default number of 32 neighbors on the bottom layer, this would be 80 bytes (80 = 20 bits * 32 neighbors / 8
) per neighbor array. Currently, we require 128 bytes (4 bytes * 32 neighbors
)
I don't think this would significantly slow down searching, since we'd still just iterate through the neighbor arrays -- it's just that instead of reading signed integers, we'd decode bitsRequired
bits at a time. There may be some details to sort through though, like avoiding creating too many PackedInts
decoder objects? (As a caveat, I've never worked with PackedInts
before so let me know if this doesn't make sense!)
I was able to replicate @jtibshirani results.
Using the glove-100-angular
from ann-benchmarks, here is the graph data size:
Baseline:
154M _0_Lucene94HnswVectorsFormat_0.vex
PackedInts:
103M _0_Lucene94HnswVectorsFormat_0.vex
Now, the implementation for PackedInts
followed closely to @jtibshirani "hacky patch". There are various changes required on read, and new information needs to be stored in regards to the PackedInts version used. In all (disregarding the fact that a new codec must be written), the code change is pretty small.
Other good news is that when running ann-benchmark on glove-100-angular
, there is no discernible difference in search performance. I am not sure how much traffic the off-heap reader gets in this benchmark. It seems to be it would be read once and stored in memory, though I might be misunderstanding something.
@jtibshirani @mayya-sharipova what do y'all think? Any ideas on how to force off-heap for search for better numbers on search changes?
Honestly, we may not care about slightly slower off-heap search performance as smaller sizes are more likely to be in the page cache.
@benwtrent thanks for looking into this! To clarify, OffHeapHnswGraph
will always be used for searches. Usually it will be paged in from disk before heavy searching, but it's still the same data structure. The on-heap version is only used for indexing.
Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance.
Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance.
Yes, I can provide some images and numbers. What I saw when running glove-100-angular
batched is that there is no difference. Will run more tests for sure.
OK, I ran this patch against minst, sift, glove, and deep image. Recall is the exact same in all cases, so no mistakes there.
There are indeed slight differences in latency, below are images. I ran search 5 times for each parameter combination and chose the fastest. All of these were with "M": 16, "efConstruction": 100
.
For each, I specifically point out the QPS at the lowest recall.
The deltas in QPS vary from near 0% change to 11% decrease. 11% does seem large but I think its an outlier. The median decrease is 2%, the average change is 0% (or 0.002%). ann_qps_comparison - ann_scores.csv
For every data set, the memory required reduced at around a similar amount.
This change seems like a good step forward to me.
@jtibshirani
Deep Image
Glove
Minst
Sift
Thanks @benwtrent , the results look promising! It seems like a good time to open a PR. We might even have ideas for optimizations once we see how the change looks.
I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within vex
, the storage improvements are much better than PackedInts.
Table with some numbers around the size improvements for different data sets & parameters:
packed_vex_mb_size | vex_mb_size | packed_index_build_time | index_build_time | params | dataset | percent_reduction |
---|---|---|---|---|---|---|
79.9 | 161.6 | 767 | 784 | "{'M': 16, 'efConstruction': 100}" | glove-100-angular | 50.55693069 |
108.4 | 464.1 | 1138 | 1225 | "{'M': 48, 'efConstruction': 100}" | glove-100-angular | 76.64296488 |
2.3 | 8.2 | 36 | 36 | "{'M': 16, 'efConstruction': 100}" | mnist-784-euclidean | 71.95121951 |
2.4 | 23.5 | 36 | 36 | "{'M': 48, 'efConstruction': 100}" | mnist-784-euclidean | 89.78723404 |
66.1 | 392.2 | 501 | 572 | "{'M': 48, 'efConstruction': 100}" | sift-128-euclidean | 83.1463539 |
59.7 | 136.6 | 449 | 516 | "{'M': 16, 'efConstruction': 100}" | sift-128-euclidean | 56.29575403 |
For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline:
Glove
MNist
SIFT
Hey this looks great! Awesome to see the storage gains with no loss in query time
On Thu, Nov 17, 2022 at 2:25 PM Benjamin Trent @.***> wrote:
I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within vex, the storage improvements are much better than PackedInts.
Table with some numbers around the size improvements for different data sets & parameters: packed_vex_mb_size vex_mb_size packed_index_build_time index_build_time params dataset percent_reduction 79.9 161.6 767 784 "{'M': 16, 'efConstruction': 100}" glove-100-angular 50.55693069 108.4 464.1 1138 1225 "{'M': 48, 'efConstruction': 100}" glove-100-angular 76.64296488 2.3 8.2 36 36 "{'M': 16, 'efConstruction': 100}" mnist-784-euclidean 71.95121951 2.4 23.5 36 36 "{'M': 48, 'efConstruction': 100}" mnist-784-euclidean 89.78723404 66.1 392.2 501 572 "{'M': 48, 'efConstruction': 100}" sift-128-euclidean 83.1463539 59.7 136.6 449 516 "{'M': 16, 'efConstruction': 100}" sift-128-euclidean 56.29575403
For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline: Glove
[image: image] https://user-images.githubusercontent.com/4357155/202539450-415f1622-cf6f-4cc6-8de5-e714b47cc8a6.png MNist
[image: image] https://user-images.githubusercontent.com/4357155/202539516-235485b1-9b01-497f-81af-ce2d7475ae74.png SIFT
[image: image] https://user-images.githubusercontent.com/4357155/202539592-d3c387e2-60e2-4956-8e92-b5b9361588bb.png
— Reply to this email directly, view it on GitHub https://github.com/apache/lucene/issues/11830#issuecomment-1319099313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQO5Y2R3FCC3FD6TM4TWI2BD7ANCNFSM6AAAAAAQYJCK7E . You are receiving this because you commented.Message ID: @.***>