lucene
lucene copied to clipboard
Add higher quantization level for kNN vector search
Description
There have been multiple discussions around supporting higher vector quantization levels (e.g. more than scalar quantization can provide).
We (elastic) have been doing some research around what the best next step will be for higher levels of quantization in Lucene.
A couple of months ago we came across an interesting paper: https://arxiv.org/abs/2405.12497 and have been actively experimenting with RaBitQ.
Its approach to quantization fits well with Lucene:
- Higher levels of compression (32x for floating point)
- Very cheap quantization costs
- Fast vector similarities
- Lower levels of reranking required
In our investigations we have found:
- RaBitQ achieves the same or better recall at the same compression ratios as Product Quantization.
- RaBitQ is 20-30x faster at quantizing
- RaBitQ (in Java) is more than 2x faster to query than PQ (can likely be made faster, where PQ cannot really...)
Let me share some PoC numbers. All our testing with PQ assumed a 32x reduction. Higher levels of reduction then require more reranking, which then becomes the main bottleneck. Our goal was to achieve good recall at only 5x reranking and without Optimized PQ (which is insanely expensive, the PQ already is very expensive).
30M cohereV3 (on a GCP intel machine avx512):
- RaBitQ:
- Time to quantize vectors:
998363ms
- Avg Brute-force query time for Recall@10:50
1776ms
with 98% recall - Time to build HNSW graph: 40043229ms
- HNSW recall@10:50: 86.2% at 1641qps
- Time to quantize vectors:
- PQ:
- Time to calculate codebooks:
472260ms
- Time to quantize vectors:
12644293ms
- Avg brute-force query time for Recall@10:50
5790ms
with recall 98% recall. - WE didn't bother building the large graph here as, frankly, its PQ is just too slow.
- Time to calculate codebooks:
522k Quora encoded with e5Small (noted for being horrible with binary quantization & higher compression ratios in general) (On my macbook, Arm M1-MAx)
- RaBitQ:
- 1051ms to quantize vectors
- Avg Brute-force query time for Recall@10:50
11ms
with 99% recall
- PQ:
- 28360ms to build the code books
- 31364ms to encode the docs
- Avg Brute-force query time for Recall@10:50
19ms
with 99% recall
Consequently, given all the various trade-offs and benefits, we are working on bringing a Rabitq format into Lucene.
I will open a draft PR soon to make the work more visible.