lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Add higher quantization level for kNN vector search

Open benwtrent opened this issue 1 year ago • 1 comments

Description

There have been multiple discussions around supporting higher vector quantization levels (e.g. more than scalar quantization can provide).

We (elastic) have been doing some research around what the best next step will be for higher levels of quantization in Lucene.

A couple of months ago we came across an interesting paper: https://arxiv.org/abs/2405.12497 and have been actively experimenting with RaBitQ.

Its approach to quantization fits well with Lucene:

  • Higher levels of compression (32x for floating point)
  • Very cheap quantization costs
  • Fast vector similarities
  • Lower levels of reranking required

In our investigations we have found:

  • RaBitQ achieves the same or better recall at the same compression ratios as Product Quantization.
  • RaBitQ is 20-30x faster at quantizing
  • RaBitQ (in Java) is more than 2x faster to query than PQ (can likely be made faster, where PQ cannot really...)

Let me share some PoC numbers. All our testing with PQ assumed a 32x reduction. Higher levels of reduction then require more reranking, which then becomes the main bottleneck. Our goal was to achieve good recall at only 5x reranking and without Optimized PQ (which is insanely expensive, the PQ already is very expensive).

30M cohereV3 (on a GCP intel machine avx512):

  • RaBitQ:
    • Time to quantize vectors: 998363ms
    • Avg Brute-force query time for Recall@10:50 1776ms with 98% recall
    • Time to build HNSW graph: 40043229ms
      • HNSW recall@10:50: 86.2% at 1641qps
  • PQ:
    • Time to calculate codebooks: 472260ms
    • Time to quantize vectors: 12644293ms
    • Avg brute-force query time for Recall@10:50 5790ms with recall 98% recall.
    • WE didn't bother building the large graph here as, frankly, its PQ is just too slow.

522k Quora encoded with e5Small (noted for being horrible with binary quantization & higher compression ratios in general) (On my macbook, Arm M1-MAx)

  • RaBitQ:
    • 1051ms to quantize vectors
    • Avg Brute-force query time for Recall@10:50 11ms with 99% recall
  • PQ:
    • 28360ms to build the code books
    • 31364ms to encode the docs
    • Avg Brute-force query time for Recall@10:50 19ms with 99% recall

Consequently, given all the various trade-offs and benefits, we are working on bringing a Rabitq format into Lucene.

I will open a draft PR soon to make the work more visible.

benwtrent avatar Aug 13 '24 17:08 benwtrent

This sounds super promising! There is some discussion here about PQ and PCA as well.

mikemccand avatar Aug 26 '24 14:08 mikemccand

Ben, is this what you referred to in your Elastic article? I read that, got interested, but had trouble ascertaining which Lucene version has this feature. I suppose it's unreleased? It would have b been helpful if you clarified the state in that article.

dsmiley avatar Dec 05 '24 09:12 dsmiley

@dsmiley I am sorry for dragging my feet here :(

It is not released yet, the PR has been open for some time, but there arose some weird recall behaviors we wanted to fix & consequently discovered an even better way to quantize at 32x since the original work mentioned. I expect to update the PR for Lucene over the holidays and aim to get it in for 10.2.0: https://github.com/apache/lucene/pull/13651

benwtrent avatar Dec 05 '24 13:12 benwtrent