Description

HNSW search is most efficient when all vector data fits in page cache. So good to keep the size of vector files as small as possible.

We currently write all HNSW graph connections as fixed-size integers. This is wasteful since most graphs have far fewer nodes than the max integer value: https://github.com/apache/lucene/blob/d2e22e18c6c92b6a6ba0bbc26d78b5e82832f956/lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java#L478

Maybe instead we could store the connection list using PackedInts.Writer. This would decrease the bits needed to store each connection. We could still ensure that every connection list takes the same number of bytes, to continue being able to index into the graph data easily.

I quickly tested the idea on the 1 million vector GloVe dataset, and saw the graph data size decrease by ~30%:

Baseline
155M	luceneknn-100-16-100.train-16-100.index/_6_Lucene94HnswVectorsFormat_0.vex

Hacky patch
103M	luceneknn-100-16-100.train-16-100.index/_5_Lucene94HnswVectorsFormat_0.vex

Sep 29 '22 00:09 jtibshirani

Nice idea, @jtibshirani! Do you have an idea what's the performance of reading this way packed values during search? Will it make searches any slower?

Sep 29 '22 13:09 mayya-sharipova

I like the idea, although I confess I don't understand how we can make it fixed width! I guess if we know the max number and it is small, we can quantize more cheaply?

Also, I wonder if we can get some benefit by delta-encoding, because we always read the neighbors in order

Sep 29 '22 15:09 msokolov

My idea: given the total number of vectors in the graph, calculate the minimum number of bits required to represent these and round that up to the nearest byte. For example, the code to write the graph would look something like this:

private void writeGraph(OnHeapHnswGraph graph) throws IOException {
  if (graph == null) return;
  // write vectors' neighbours on each level into the vectorIndex file
  int countOnLevel0 = graph.size();
+ int bitsRequired = PackedInts.bitsRequired(countOnLevel0);

  for (int level = 0; level < graph.numLevels(); level++) {
    int maxConnOnLevel = level == 0 ? (M * 2) : M;
    NodesIterator nodesOnLevel = graph.getNodesOnLevel(level);
    while (nodesOnLevel.hasNext()) {
      int node = nodesOnLevel.nextInt();
      NeighborArray neighbors = graph.getNeighbors(level, node);
      int size = neighbors.size();
      vectorIndex.writeInt(size);
      // Destructively modify; it's ok we are discarding it after this
      int[] nnodes = neighbors.node();
      Arrays.sort(nnodes, 0, size);
+     PackedInts.Writer packedIntsWriter = PackedInts.getWriterNoHeader(
+          vectorIndex, PackedInts.Format.PACKED, maxConnOnLevel, bitsRequired, 1);
      for (int i = 0; i < size; i++) {
        int nnode = nnodes[i];
-       vectorIndex.writeInt(nnode);
-     }
-     // if number of connections < maxConn,
-     // add bogus values up to maxConn to have predictable offsets
-     for (int i = size; i < maxConnOnLevel; i++) {
-       vectorIndex.writeInt(0);
+      packedIntsWriter.add(nnode);
     }
+    // if number of connections < maxConn, this call automatically adds padding
+    packedIntsWriter.finish();
    }

As an example, if a graph had 1 million neighbors, bitsRequired would only be 20. So with the default number of 32 neighbors on the bottom layer, this would be 80 bytes (80 = 20 bits * 32 neighbors / 8) per neighbor array. Currently, we require 128 bytes (4 bytes * 32 neighbors)

I don't think this would significantly slow down searching, since we'd still just iterate through the neighbor arrays -- it's just that instead of reading signed integers, we'd decode bitsRequired bits at a time. There may be some details to sort through though, like avoiding creating too many PackedInts decoder objects? (As a caveat, I've never worked with PackedInts before so let me know if this doesn't make sense!)

Oct 03 '22 22:10 jtibshirani

I was able to replicate @jtibshirani results.

Using the glove-100-angular from ann-benchmarks, here is the graph data size:

Baseline:
154M  _0_Lucene94HnswVectorsFormat_0.vex

PackedInts:
103M _0_Lucene94HnswVectorsFormat_0.vex

Now, the implementation for PackedInts followed closely to @jtibshirani "hacky patch". There are various changes required on read, and new information needs to be stored in regards to the PackedInts version used. In all (disregarding the fact that a new codec must be written), the code change is pretty small.

Other good news is that when running ann-benchmark on glove-100-angular, there is no discernible difference in search performance. I am not sure how much traffic the off-heap reader gets in this benchmark. It seems to be it would be read once and stored in memory, though I might be misunderstanding something.

@jtibshirani @mayya-sharipova what do y'all think? Any ideas on how to force off-heap for search for better numbers on search changes?

Honestly, we may not care about slightly slower off-heap search performance as smaller sizes are more likely to be in the page cache.

Oct 06 '22 14:10 benwtrent

@benwtrent thanks for looking into this! To clarify, OffHeapHnswGraph will always be used for searches. Usually it will be paged in from disk before heavy searching, but it's still the same data structure. The on-heap version is only used for indexing.

Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance.

Oct 06 '22 16:10 jtibshirani

Could you include a comparison of the search latency and recall numbers you see with this approach? Sometimes with our benchmarks it's easy to miss small differences in performance.

Yes, I can provide some images and numbers. What I saw when running glove-100-angular batched is that there is no difference. Will run more tests for sure.

Oct 06 '22 17:10 benwtrent

OK, I ran this patch against minst, sift, glove, and deep image. Recall is the exact same in all cases, so no mistakes there.

There are indeed slight differences in latency, below are images. I ran search 5 times for each parameter combination and chose the fastest. All of these were with "M": 16, "efConstruction": 100.

For each, I specifically point out the QPS at the lowest recall.

The deltas in QPS vary from near 0% change to 11% decrease. 11% does seem large but I think its an outlier. The median decrease is 2%, the average change is 0% (or 0.002%). ann_qps_comparison - ann_scores.csv

For every data set, the memory required reduced at around a similar amount.

This change seems like a good step forward to me.

@jtibshirani

Deep Image

deep-image-96-angular-batch

Glove

glove-100-angular-batch

Minst

mnist-784-euclidean-batch

Sift

sift-128-euclidean-batch

Oct 14 '22 16:10 benwtrent

Thanks @benwtrent , the results look promising! It seems like a good time to open a PR. We might even have ideas for optimizations once we see how the change looks.

Oct 14 '22 23:10 jtibshirani

I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within vex, the storage improvements are much better than PackedInts.

Table with some numbers around the size improvements for different data sets & parameters:

packed_vex_mb_size	vex_mb_size	packed_index_build_time	index_build_time	params	dataset	percent_reduction
79.9	161.6	767	784	"{'M': 16, 'efConstruction': 100}"	glove-100-angular	50.55693069
108.4	464.1	1138	1225	"{'M': 48, 'efConstruction': 100}"	glove-100-angular	76.64296488
2.3	8.2	36	36	"{'M': 16, 'efConstruction': 100}"	mnist-784-euclidean	71.95121951
2.4	23.5	36	36	"{'M': 48, 'efConstruction': 100}"	mnist-784-euclidean	89.78723404
66.1	392.2	501	572	"{'M': 48, 'efConstruction': 100}"	sift-128-euclidean	83.1463539
59.7	136.6	449	516	"{'M': 16, 'efConstruction': 100}"	sift-128-euclidean	56.29575403

For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline:

Glove

glove-100-angular

MNist

mnist-784-euclidean

SIFT

sift-128-euclidean

Nov 17 '22 19:11 benwtrent

Hey this looks great! Awesome to see the storage gains with no loss in query time

On Thu, Nov 17, 2022 at 2:25 PM Benjamin Trent @.***> wrote:

I changed the PR to move towards delta encoding & vint. Even with storing the memory offsets within vex, the storage improvements are much better than PackedInts.

Table with some numbers around the size improvements for different data sets & parameters: packed_vex_mb_size vex_mb_size packed_index_build_time index_build_time params dataset percent_reduction 79.9 161.6 767 784 "{'M': 16, 'efConstruction': 100}" glove-100-angular 50.55693069 108.4 464.1 1138 1225 "{'M': 48, 'efConstruction': 100}" glove-100-angular 76.64296488 2.3 8.2 36 36 "{'M': 16, 'efConstruction': 100}" mnist-784-euclidean 71.95121951 2.4 23.5 36 36 "{'M': 48, 'efConstruction': 100}" mnist-784-euclidean 89.78723404 66.1 392.2 501 572 "{'M': 48, 'efConstruction': 100}" sift-128-euclidean 83.1463539 59.7 136.6 449 516 "{'M': 16, 'efConstruction': 100}" sift-128-euclidean 56.29575403

For the curious, here are the QPS numbers (higher is better) for packed (delta & vint) vs baseline: Glove

[image: image] https://user-images.githubusercontent.com/4357155/202539450-415f1622-cf6f-4cc6-8de5-e714b47cc8a6.png MNist

[image: image] https://user-images.githubusercontent.com/4357155/202539516-235485b1-9b01-497f-81af-ce2d7475ae74.png SIFT

[image: image] https://user-images.githubusercontent.com/4357155/202539592-d3c387e2-60e2-4956-8e92-b5b9361588bb.png

— Reply to this email directly, view it on GitHub https://github.com/apache/lucene/issues/11830#issuecomment-1319099313, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHHUQO5Y2R3FCC3FD6TM4TWI2BD7ANCNFSM6AAAAAAQYJCK7E . You are receiving this because you commented.Message ID: @.***>

Nov 18 '22 15:11 msokolov

lucene
lucene copied to clipboard

Store HNSW graph connections more compactly

Description

Deep Image

Glove

Minst

Sift

Glove

MNist

SIFT

lucene lucene copied to clipboard

Store HNSW graph connections more compactly

Description

Deep Image

Glove

Minst

Sift

Glove

MNist

SIFT

lucene
lucene copied to clipboard