lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Should we explore DiskANN for aKNN vector search?

Open mikemccand opened this issue 1 year ago • 41 comments

Description

I came across this compelling sounding JVector project which looks to have awesome QPS performance.

It uses DiskANN instead of HNSW (what Lucene uses now).

Maybe we should explore another aKNN Codec implementation using this? It'd also be a good test of the Codec pluggability of our KNN implementation.

mikemccand avatar Oct 03 '23 10:10 mikemccand

I do think Lucene's read-only segment based architecture leads itself to support quantization (required for DiskANN).

It would be an interesting experiment to see how indexing times, segment merges, etc. are all handled with DiskANN.

I wonder how much the speed difference is:

  • Vectors being out of memory (and if they used PQ for diskann, if they did, we should test PQ with HNSW).
  • Not merging to a single segment and searching multiple segments serially.

benwtrent avatar Oct 03 '23 11:10 benwtrent

@benwtrent For merges there is "FreshDiskANN: A Fast and Accurate Graph-Based ANN Index for Streaming Similarity Search" https://arxiv.org/pdf/2105.09613.pdf

DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene.

robertvanwinkle1138 avatar Oct 04 '23 16:10 robertvanwinkle1138

DiskANN is known to be slower at indexing than HNSW and the blog post does not compare single threaded index times with Lucene.

@robertvanwinkle1138 this is just one of my concerns.

Additionally, I think we would still have the "segment merging problem". I do think we can get HNSW merges down by keeping connections between graphs, I just haven't had cycles myself to dig into that. There is a Lucene issue around making HNSW merges faster somewhere...

https://github.com/apache/lucene/issues/12440

benwtrent avatar Oct 04 '23 16:10 benwtrent

A hybrid disk-memory algorithm would have very strong benefits. I did run a few tests recently that confirmed HNSW does not function very well when memory gets constrained (which I think everyone already knew).

I wonder though, instead of DiskANN, what about a partitioning based approach such as SPANN? I think a partitioning based approach for Lucene would make merging, updating, filtering and indexing a lot easier. Also, it seems it would have better disk-access patterns. In the paper, they do show that in a memory constrained environment, they were able to outperfrom DiskANN.

I guess the tradeoff might be that partitioning based approaches would struggle to achieve really low latency search when in memory compared to graph-based approaches. Additionally, partitioning approaches would require a potentially expensive "training" or "preprocessing" step such as k-Means and performance could degrade if data distribution drifts and the models are not updated. But, if PQ is implemented, the same considerations would need to be taken as well.

jmazanec15 avatar Oct 04 '23 17:10 jmazanec15

@jmazanec15 I agree that SPANN seems more attractive. I would argue though we don't need to do clustering (in the paper they do clustering, but with minimal effectiveness), but could maybe get away with randomly selecting a subset of vectors per segment.

benwtrent avatar Oct 04 '23 17:10 benwtrent

The SPANN paper does not address efficient filtered queries. For example, Lucene's HNSW calculates the similarity score for every record, regardless of the record matching the filter.

Filtered − DiskANN [1] describes a solution for efficient filtered queries.

QDrant has a filter solution however the methodology described in their blog is opaque.

  1. https://dl.acm.org/doi/pdf/10.1145/3543507.3583552

As Approximate Nearest Neighbor Search (ANNS)-based dense retrieval becomes ubiquitous for search and recommendation scenarios, efciently answering fltered ANNS queries has become a critical requirement. Filtered ANNS queries ask for the nearest neighbors of a query’s embedding from the points in the index that match the query’s labels such as date, price range, language. There has been little prior work on algorithms that use label metadata associated with vector data to build efcient indices for fltered ANNS queries. Consequently, current indices have high search latency or low recall which is not practical in interactive web-scenarios. We present two algorithms with native support for faster and more accurate fltered ANNS queries: one with streaming support, and another based on batch construction. Central to our algorithms is the construction of a graph-structured index which forms connections not only based on the geometry of the vector data, but also the associated label set. On real-world data with natural labels, both algorithms are an order of magnitude or more efcient for fltered queries than the current state of the art algorithms. The generated indices also be queried from an SSD and support thousands of queries per second at over 90% recall@10.

robertvanwinkle1138 avatar Oct 05 '23 15:10 robertvanwinkle1138

QDrant has a filter solution however the methodology described in their blog is opaque.

QDrant's HNSW filter solution is the exact same as Lucene's. You can look at the code, they don't filter candidate exploration but filer result collection.

You are correct that filtering with SPANN would be different. Though I am not sure its intractable.

It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings.

But I think we can do that before scoring. So, as candidate posting lists are gathered, ensure they have some candidates.

The SPANN repository supports filtering, and its OSS, so we could always just read what they did.

benwtrent avatar Oct 05 '23 17:10 benwtrent

QDrant's HNSW filter solution is the exact same as Lucene's

Interesting thanks.

as candidate posting lists are gathered, ensure they have some candidates

Couldn't that be done with the current Lucene HNSW implementation?

The SPANN repository supports filtering, and its OSS, so we could always just read what they did.

This search with filter method seems to throw an error.

https://github.com/microsoft/SPTAG/blob/main/AnnService/src/Core/SPANN/SPANNIndex.cpp#L262

robertvanwinkle1138 avatar Oct 06 '23 14:10 robertvanwinkle1138

This search with filter method seems to throw an error.

LOL, I thought it was supported, I must have read a github issue and made an assumption.

Couldn't that be done with the current Lucene HNSW implementation?

I think so. The tricky thing is making sure enough matching postings are gathered to give an accurate number for recall. Having to go back to the HNSW graph to get more postings list could get expensive if we keep getting to few matches.

benwtrent avatar Oct 06 '23 15:10 benwtrent

SPANN is another option?

https://www.researchgate.net/publication/356282356_SPANN_Highly-efficient_Billion-scale_Approximate_Nearest_Neighbor_Search

mikemccand avatar Oct 07 '23 15:10 mikemccand

(listening to @jbellis talk at Community over Code).

mikemccand avatar Oct 07 '23 15:10 mikemccand

Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector)

mikemccand avatar Oct 07 '23 15:10 mikemccand

What say you @jbellis :-) I recommended a module of Lucene when we spoke at Community-over-Code. A dependency outside is okay for non-core.

dsmiley avatar Oct 20 '23 05:10 dsmiley

Responding top to bottom,

I wonder how much the speed difference is due to (1) Vectors being out of memory (and if they used PQ for diskann, if they did, we should test PQ with HNSW). (2) Not merging to a single segment and searching multiple segments serially.

(1) 90% of it is the PQ, yes. I assume that storing the vector inline w/ the graph also helps some but I did not test that separately. You could definitely get a big speed up just using PQ on top of HNSW.

(2) Single segment in both cases. (JVector leaves segment management as an exercise for the user.)

jbellis avatar Oct 20 '23 13:10 jbellis

DiskANN is known to be slower at indexing than HNSW

I don't remember the numbers here, maybe 10% slower? It wasn't material enough to make me worry about it. (This is with using an HNSW-style incremental build, not the two-pass build described in the paper.)

and the blog post does not compare single threaded index times with Lucene.

It was about 30% worse several months ago with my concurrent hnsw patch, should be significantly improved now since JVector (1) doesn't need the CompletionTracker to keep the layers consistent, b/c it's single layer, (2) added a DenseIntMap concurrent collection to replace ConcurrentHashMap for the nodes.

jbellis avatar Oct 20 '23 13:10 jbellis

It is possible that the candidate postings (gathered via HNSW) don't contain ANY filtered docs. This would require gathering more candidate postings.

This was a big problem for our initial deployment, we'd filter down to a few valid items and then spend forever searching the graph for them. Had to add a fairly accurate estimate of how many comparisons the index would need, and use that to decide whether to brute-force the comparison instead. (This is in Cassandra, not JVector, specifically VectorMemtableIndex.expectedNodesVisited.) I don't remember seeing code for this in Lucene but I mostly only looked at the HNSW code so I could have missed it.

jbellis avatar Oct 20 '23 13:10 jbellis

Or perhaps we "just" make a Lucene Codec component (KnnVectorsFormat) that wraps jvector? (https://github.com/jbellis/jvector)

I'm happy to support anyone who wants to try this, including modifying JVector to make it a better fit if necessary. I won't have time to tackle this myself in the medium-term, however.

jbellis avatar Oct 20 '23 13:10 jbellis

So, I replicated the jvector benchmark (the lucene part) using the new int8 quantization.

Note, this is with 0 fan out or extra top-k gathered. Since the benchmark on JVector didn't specify any recall, etc. I just did the absolute baseline of top-100.

I reserved 12GB to heap, thus reducing off-heap memory to at most 20GB. (I only have 32GB of ram)

1 thread over 37 segments:
completed 1000 searches in 18411 ms: 54 QPS CPU time=18231ms
checking results
0.777	18.23	100000000	0	16	100	100	0	1.00	post-filter

Since kNN allows the segments to be searched in parallel, I used 8 threads for the query rewrite

8 threads over 37 segments:
completed 1000 searches in 2996 ms: 333 QPS CPU time=218ms
checking results
0.777	 0.22	100000000	0	16	100	100	0	1.00	post-filter

I am currently force-merging to a single segment to see what a single graph gives us.

FYI: the data set would require > 40GB of ram to be held in memory. With int8 quantization, its down to around 10GB.

EDIT:

I force merged down to 8 segments. A single segment here is just huge, I thought it prudent to test something more paletable in segment size. Though, these segments are still fairly large (~8 GB). Still using multiple threads to search so that each segment has its own thread:

completed 1000 searches in 1107 ms: 903 QPS CPU time=71ms
checking results
0.755	 0.07	100000000	0	16	100	100	0	1.00	post-filter

But, searching with just a single thread over 8 segment is still not too shabby (when comparing to the QPS claimed by JVector)

INFO: Java vector incubator API enabled; uses preferredBitSize=128
completed 1000 searches in 4693 ms: 213 QPS CPU time=4614ms
checking results
0.755	 4.61	100000000	0	16	100	100	0	1.00	post-filter

benwtrent avatar Nov 01 '23 11:11 benwtrent

Hey @benwtrent and all, just wanted to let you know that I'm experimenting some with different index structures for larger than memory indexes.

I have a working implementation of Vamana based off the existing HNSW implementation (with vectors colocated with the adjacency lists in storage) in this branch, which I'm currently working on integrating scalar quantization with, and a benchmarking framework here which can run various Lucene and JVector configurations.

I don't have many numbers to share yet besides the fact that the Vamana graph implementation in a single segment seems competitive while in memory with Lucene HNSW (single segment) and JVector on small data sets. For glove-100-angular from ann-benchmarks (k=10):

lucene_hnsw_maxConn:16-beamWidth:100_numCandidates:150
	average recall 0.8227400000000001
	average duration PT0.000968358S
        index size: 529M

jvector_vamana_M:16-beamWidth:100-neighborOverflow:2.0-alpha:1.2_pqFactor:0-numCandidates:100
	average recall 0.8242200000000001
	average duration PT0.00124232S
        index size: 703M

lucene_sandbox-vamana_maxConn:32-beamWidth:100-alpha:1.2-_numCandidates:100
	average recall 0.82553
        average duration PT0.000940756S
        index size: 554M

I plan on testing vamana without PQ, vamana with PQ (a la DiskANN), as well as SPANN. Happy to collaborate with anyone interested.

kevindrosendahl avatar Nov 03 '23 22:11 kevindrosendahl

@kevindrosendahl this looks really interesting! Thank you for digging in and starting the experimentation!

I haven't had a chance to read your branch yet, but hope to soon.

benwtrent avatar Nov 07 '23 11:11 benwtrent

I haven't had a chance to read your branch yet, but hope to soon.

Great, thanks! To save you a bit of time, the tl;dr of going from HNSW to vamana is that it's actually fairly simple in the end. For the most part, I just removed all references to levels and changed the diversity check to incorporate vamana's alpha parameter (essentially just adding a second, outer loop). There were a few small other implementation detail changes as well like pruning down all the way to M when passing the overflow threshold instead of just removing one neighbor.

Then on the storage size, you just write vectors into the graph instead of into the .vec file, and read from the right offsets to get the vector values as appropriate.

kevindrosendahl avatar Nov 07 '23 21:11 kevindrosendahl

@kevindrosendahl if I am reading the code correctly, it does the following:

  • Write int8 quantized vectors along side the vector ordinals in the graph (.vex or whatever has a copy of each vector).
  • Continue to write vectors in .vec. I am guessing this is a stop-gap and you are thinking of removing this? Maybe not?

I have a concern around index sorting. How does building the graph & subsequent getFloatVectors() play with index sorting? Usually when folks sort an index, they expect to be able to iterate values in that given order. Is it horrifically slow to iterate .vex in this scenario?

What do we think about always keeping .vec around? Probably should for re-ranking purposes once more extreme quantization measures are used.

One more question, have you tested your implementation in the situation where .vex cannot all be paged into memory and it faired ok?

benwtrent avatar Nov 08 '23 21:11 benwtrent

Perhaps much of the jvector performance improvement is simply from on heap caching.

https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/disk/GraphCache.java

robertvanwinkle1138 avatar Nov 09 '23 23:11 robertvanwinkle1138

Another notable difference in the Lucene implementation is delta variable byte encoding of node ids. The increase in disk space requires the user to purchase more RAM per server. Also variable byte encoding is significantly slower to decode than raw integers.

In the example listed the jvec index size is a whopping 32% greater.

https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java#L593

https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/disk/OnDiskGraphIndex.java#L130

robertvanwinkle1138 avatar Nov 10 '23 09:11 robertvanwinkle1138

@benwtrent:

if I am reading the code correctly, it does the following:

  • Write int8 quantized vectors along side the vector ordinals in the graph (.vex or whatever has a copy of each vector).
  • Continue to write vectors in .vec. I am guessing this is a stop-gap and you are thinking of removing this? Maybe not?

The implementation will write the vectors at their requested encoding/quantization level into the graph. So if your vectors are float[]s with no quantization, float[]s go into the graph. If your vectors are byte[]s or float[]s with quantization, byte[]s go into the graph. If you do enable quantization, it keeps the full fidelity vectors around on the side for good measure, otherwise it only keeps a copy in the graph.

I have a concern around index sorting. How does building the graph & subsequent getFloatVectors() play with index sorting? Usually when folks sort an index, they expect to be able to iterate values in that given order. Is it horrifically slow to iterate .vex in this scenario?

What do we think about always keeping .vec around? Probably should for re-ranking purposes once more extreme quantization measures are used.

Interesting re: sorting. I hadn't given that much deep thought yet. To level-set a little on how I'm planning on approaching this, I think it's still a bit of an open question what performance is theoretically possible, and what type of index could possibly get us there.

I'm trying first to get a sense of the possibilities there, so my main focus first is on measuring the single segment performance in order to rule out ideas not worth pursuing before investing too much time in them.

If a single segment index works well, I think the next step would probably be to make sure it would work well with concurrent query execution over multiple segments, and then to start thinking about how it could be productionalized (sorted indexes, merges, etc).

Thoughts on this approach?

One more question, have you tested your implementation in the situation where .vex cannot all be paged into memory and it faired ok?

I've just got some first results for larger than memory performance, will post a separate comment.

kevindrosendahl avatar Nov 11 '23 01:11 kevindrosendahl

I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results.

TL;DR:

  • the main thing driving jvector's larger-than-memory performance is keeping product-quantized vectors in memory, which avoids the need for I/O to do distance calculations. This is reflected in ~24x less page faults, and a similar improvement in query latency
  • for this dataset, introducing PQ does not have a negative effect on recall until reaching a factor of 16, after which recall begins to drop off. recall actually slightly improves with each factor increase up to 4, and does not materially change at 8
  • storing the vectors inline in the graph does not seem to have a large impact on larger than memory performance
  • other differences like having a cache of neighbors close to the entry point or storing offsets in memory vs having consistent offsets are marginal differences, and don't account for the order of magnitude improvement

Next steps:

  • measure lucene performance with scalar quantization
  • incorporate PQ into the lucene vamana implementation
  • explore SPANN to see how it performs relative to these implementations

Setup

I used the Cohere/wikipedia-22-12-es-embeddings (768 dimensions) with L2 similarity. I split the dataset into 10,000 vectors randomly sampled from the dataset as the test set, with the remaining 10,114,929 vectors as the training set.

I built and ran the query tests on a c7gd.16xlarge (128 GB memory, 64 threads. Thank you for the addition of concurrent merging, it helped speed up the builds drastically!). The queries with restricted system memory were done by purging the host page cache, then running in a docker container with -m 8g a Coretto Java 21 JVM with -Xms4g and -Xmx4g and the Panama vector API enabled, and with the dataset and indexes bind mounted into the container. There is a warmup phase of 2 iterations of the 10,000 test vectors followed by 3 measured iterations of the 10,000 vectors, all iterations running 48 queries at a time with a single thread per query.

Note that the indexes aren't quite apples-to-apples, as the Lucene HNSW index configuration (maxConn: 16, beamWidth: 100) achieves 0.8247 recall (and slightly better latency when in memory) whereas both Vamana configurations (M: 32, beamWidth: 100, alpha: 1.2) achieve ~0.91 recall, but the broad strokes are obvious enough to draw some conclusions.

Index size

configuration size breakdown
lucene hnsw 31.42 GB 31.07 GB vectors + 0.35 GB graph
lucene vamana 31.73 GB 31.73 GB graph with vectors
jvector pq=0 32.45 GB 32.45 GB graph with vectors
jvector pq=2 36.33 GB 32.45 GB graph with vectors + 3.88 GB quantized vectors
jvector pq=4 34.39 GB 32.45 GB graph with vectors + 1.94 GB quantized vectors
jvector pq=8 33.42 GB 32.45 GB graph with vectors + 0.97 GB quantized vectors
jvector pq=16 32.93 GB 32.45 GB graph with vectors + 0.48 GB quantized vectors

Queries fully in memory (124 GB of RAM available)

configuration recall average duration
lucene hnsw 0.8247 0.00187s
lucene vamana 0.9086 0.00257s
jvector pq=0 0.9108 0.00495s
jvector pq=2 0.9109 0.00259s
jvector pq=4 0.9122 0.00291s
jvector pq=8 0.9121 0.00148s
jvector pq=16 0.8467 0.0012s

Queries with 8 GB system memory, 4 GB heap

configuration recall average duration average page faults total page faults
lucene hnsw 0.8247 2.950s 1464.65 4.39395E7
lucene vamana 0.9086 3.122s 1662.26 4.9867932E7
jvector pq=0 0.9108 3.651s 1716.03 5.1481117E7
jvector pq=2 0.9109 0.127s 69.65 2089449
jvector pq=4 0.9122 0.131s 68.94 2068274
jvector pq=8 0.9121 0.119s 70.35 2110418
jvector pq=16 0.8467 0.126s 72.64 2179330

Conclusions

conclusion reasoning
storing the vectors inline in the graph does not seem to have a large impact on larger than memory performance lucene hnsw and lucene vamana performance is in the same order of magnitude with simlar numbers of page faults
jvector's neighbor cache and consistent offsets are not large determinants in improving larger than memory performance lucene vamana (which has no cache and in-memory offsets) and jvector vamana pq=0 performance is in the same order of magnitude
pq is the largest determinant in improving performance jvector pq=2 performance is ~28x better than jvector pq=0 performance, and all pq=(n != 0) performance is similar
introducing pq does not immediately impact recall on this dataset recall actually improves when introducing pq, and only starts to decrease at a factor of 16

kevindrosendahl avatar Nov 11 '23 01:11 kevindrosendahl

I've got my framework set up for testing larger than memory indexes and have some somewhat interesting first results.

Thank you for setting this up @kevindrosendahl -- these are very interesting results. It's great that PQ goes quite a ways before hurting recall.

mikemccand avatar Nov 13 '23 14:11 mikemccand

Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to quantization. But, this does confuse me as Vamana is supposedly good when things don't all fit in memory. It may be due to how we fetch things (MMAP). I wonder if Vamana is any good at all when using MMAP...

For your testing infra, int8 quantization might close the gap at that scale. FYI, as significant (and needed) refactor occurred recently for int8 quantization & HNSW, so your testing branch might be significantly impacted :(.

recall actually improves when introducing pq, and only starts to decrease at a factor of 16

I am surprised it decreases as the number of sub-spaces increases. This makes me thing that JVector's PQ implementation is weird.

Or is pq= not the number of subspaces, but vectorDim/pq == numberOfSubSpaces? If so, then recall should reduce as it increases.

Regardless, is there any oversampling that is occurring when PQ is enabled in JVector?

It's great that PQ goes quite a ways before hurting recall.

PQ is a sharp tool, at scale it can have significant draw backs (eventually you have to start dramatically oversampling as centroids get very noisy). Though, I am not sure there is a significant recall cliff.

Two significant issues with a Lucene implementation I can think of are:

  • Segment merge time: We can maybe do some fancy things with better starter centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's algorithm on every merge. Additionally the graph building probably cannot be done with the PQ'd vectors.
  • Graph quality: I don't think we can build the graph with PQ'd vectors and retain good recall. Meaning at merge time, we have to page in larger raw (or differently quantized) vectors and only do PQ after graph creation.

There are interesting approaches to PQ w/ graph exploration and a different PQ implementation via Microsoft that is worthwhile OPQ

benwtrent avatar Nov 13 '23 15:11 benwtrent

recall actually improves when introducing pq, and only starts to decrease at a factor of 16

I would guess that either there is a bug or you happen to be testing with a really unusual dataset. PQ is fundamentally a lossy compression and can't magically create similarity that didn't exist in the original.

Regardless, is there any oversampling that is occurring when PQ is enabled in JVector?

Today it's up to the caller (so on the Cassandra side, in Astra) but it's possible that it should move into JVector.

Additionally the graph building probably cannot be done with the PQ'd vectors.

I suppose it's not impossible that you could compress first and then build but I have not seen anyone do it yet. JVector follows DiskANN's lead and builds the graph using uncompressed vectors.

jbellis avatar Nov 13 '23 16:11 jbellis

@benwtrent

Thank you @kevindrosendahl this does seem to confirm my suspicion that the improvement isn't necessarily due to the data structure, but due to quantization. But, this does confuse me as Vamana is supposedly good when things don't all fit in memory. It may be due to how we fetch things (MMAP). I wonder if Vamana is any good at all when using MMAP...

Yeah, upon reflection the result makes sense. DiskANN only uses the in-memory PQ vectors for the distance calculations for deciding which new candidates it should consider visiting in the future. This is the critical piece which reduces I/O. The in-graph vectors mean that when you have decided the next candidate, you get the full fidelity vector for free on the I/O to get the adjacency list, but at that point you've already done the distance calculation against that candidate. If you were to grab the full fidelity vector from the graph for the distance calculation like the non-PQ vamana implementations do, you've moved the I/O complexity back from O(candidates_explored) to O(nodes_considered), and probably need to fetch the page again when you've decided to actually consider a candidate to re-grab it's adjacency list.

What that full fidelity vector is useful for is re-scoring, so you can just keep that full fidelity vector in memory until the end of the search (jvector reference) and resort the final result list using their true distances (jvector reference).

I'm not sure how impactful this re-ranking is, hoping to A/B it when I get PQ working, but assuming it's impactful, the index structure could save up to numCandidates I/Os. In this case jvector was only doing ~70 I/Os, so adding up to 100 on top could be a pretty big deal.

The main thing mmap may prevent us from "easily" doing is parallelizing the I/O, which I believe the DiskANN paper does but jvector does not currently (possibly due to the results looking pretty decent without it, and it being hard with mmap/Java). Out of curiosity I may try to madvise with MADV_WILLNEED, but not holding my breath there.

Or is pq= not the number of subspaces, but vectorDim/pq == numberOfSubSpaces? If so, then recall should reduce as it increases.

This is correct, pq in the tables relates to the pqFactor in JVector, which is as you've described. Agreed with you and @jbellis that the observed result is bizarre.

For your testing infra, int8 quantization might close the gap at that scale.

I ran some tests with int8 quantization, nothing very surprising in the results. If you can get the full index to fit in memory it performs great, but performance degrades significantly once it falls out of memory proportional to I/O. The scalar quant vectors were 7.3 GB with the HNSW graph being 329 MB. For reference, jvector pqFactor=8 was 32.45 GB graph with vectors + 0.97 GB quantized vectors.

index configuration memory configuration average recall average latency average page faults
hnsw scalar quant fully in memory 0.79819 0.001259s 0
jvector pqFactor=8 fully in memory 0.9121 0.00148s 0
hnsw scalar quant 10GB system 2GB heap 0.79819 0.00119s 0
hnsw scalar quant 8GB system 4GB heap 0.79819 0.856s 606.98
jvector pqFactor=8 4GB system 2GB heap 0.9121 0.151s 80.15

So fundamentally with these graph structures, the key is to have the vectors needed for distance calculations of potential neighbor candidates in memory. Scalar quantization helps the cause here by reducing the size of the vectors by 4. PQ can help even more by even more drastically reducing the memory impact. JVector further ensures that the quantized vectors are in memory by storing them on the heap.

FYI, as significant (and needed) refactor occurred recently for int8 quantization & HNSW, so your testing branch might be significantly impacted :(.

Is there any expected performance difference, or is it mainly organizational? The code in my branch is not in a state I would be comfortable suggesting for upstreaming, so if it's "just" a problem for that, I'm ok to keep my branch a bit outdated, and we could make any suitable ideas fit in if/when we thought upstreaming could be worthwhile.

Two significant issues with a Lucene implementation I can think of are:

  • Segment merge time: We can maybe do some fancy things with better starter centroids in Lloyd's algorithm, but I think we will have to rerun Lloyd's algorithm on every merge. Additionally the graph building probably cannot be done with the PQ'd vectors.
  • Graph quality: I don't think we can build the graph with PQ'd vectors and retain good recall. Meaning at merge time, we have to page in larger raw (or differently quantized) vectors and only do PQ after graph creation.

There are interesting approaches to PQ w/ graph exploration and a different PQ implementation via Microsoft that is worthwhile OPQ

Yeah, seems like for larger-than-memory indexes, some form of clustering or quantization is going to be a must have. I do really want to try out SPANN as well. It seems analogous to lucene's existing FST-as-an-index-to-a-postings-list design, and my hunch is that there may be more tricks you can play at merge time to reduce the number of times you need to re-cluster things and potentially the computational complexity. It's also just a lot simpler conceptually.

As an aside, I'm not sure I'll have too much time to devote to this this week, but hoping to continue making forward progress.

If anyone has any other datasets ranging from ~30-50 GB that could be useful for testing that would be helpful too.

kevindrosendahl avatar Nov 13 '23 18:11 kevindrosendahl