lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Add Query for reranking KnnFloatVectorQuery with full-precision vectors

Open dungba88 opened this issue 11 months ago • 32 comments
trafficstars

Description

fixes #13564

Added a new Query which wraps around KnnFloatVectorQuery and does re-ranking for quantized index using full precision vectors. The idea is to first run KnnFloatVectorQuery with over-sampled k (x1.5, x2, x5, etc) and then re-rank the docs using full-precision (original, non-quantized) vector, and finally take top-k.

Questions:

  • Should we expose the target inside KnnFloatVectorQuery so that users don't need to pass the target twice? Currently it only exposes the getTargetCopy() which requires array copy so it's inefficient, but I assume the intention is to encapsulate the array so that it won't be modified from outside?
  • Maybe out of scope for this PR, but I'm curious how people think about using mlock for preventing the quantized vectors from being swapped out, as loading fp vectors (although only a small set per query) means there will more pressure on RAM.

Usage:

KnnFloatVectorQuery knnQuery = ...; // create the KnnFloatVectorQuery with some over-sampled k
RerankKnnFloatVectorQuery query = new RerankKnnFloatVectorQuery(knnQuery, targetVector, k);
TopDocs topDocs = searcher.search(query, k);

dungba88 avatar Nov 22 '24 00:11 dungba88

The build fails with The import org.apache.lucene.codecs.lucene100 cannot be resolved, I thought this is already in mainline. Will check.

Edit: It has been moved to backward codecs. Will use something more stable.

dungba88 avatar Nov 22 '24 00:11 dungba88

I have a preliminary benchmark here (top-k=100, fanout=0) using Cohere 768 dataset.

image

Anyhow I can see these 2 things that should be addressed:

  • If we access the full-sized vectors, it will swap the memory that is allocated (either through preloading, or through mmap) for quantized vectors (main search phase) if there's not enough memory. Eventually, some % part of the quantized index will be swapped out which will slower the search. If we have to load all full-precision vectors to memory, then that kinda defeats the purpose of quantization. I'm wondering if there could be a way we can access full-precision vectors without interfering with the space of quantized vectors.
  • The latency could be better. With oversample=1.5 (second dot) for 4_bit, we have around the same latency and recall as baseline. Although one can argue that we can save memory compared to baseline, with new access pattern of two-phase search that saving might be diminished. Otherwise it seems to have little benefit over just using plain HNSW.

dungba88 avatar Nov 26 '24 06:11 dungba88

Edit: My previous benchmark was wrong because the vectors are corrupted

First benchmark show the recall improvement for each oversample with reranking. It now aligns with what was produced in https://github.com/apache/lucene/pull/13651.

Screenshot 2024-11-30 at 7 50 50

Second benchmark compare the latency across all algorithms. We are still adding only a small latency for the reranking phase.

Screenshot 2024-11-30 at 7 57 25

Last benchmark, I just ran oversample without reranking, but still cutoff at original K (so they act similar to fanout). This is just to make sure that the reranking phase actually adds value. Expectedly, the recall does not improve much compared to the reranking.

Screenshot 2024-11-30 at 7 52 13

NOTE: The dots in all benchmarks represent the oversample factor with values of 1, 1.5, 2, 3, 4, 5. Oversample of 1 means no over-sampling. See https://github.com/mikemccand/luceneutil/blob/main/src/main/knn/KnnGraphTester.java#L833-L834

dungba88 avatar Nov 27 '24 03:11 dungba88

Also this is the luceneutil branch I used for benchmarking: https://github.com/dungba88/luceneutil/tree/dungba88/two-phase-search, which incorporates the test for BQ implementation by @benwtrent and the two-phase search.

dungba88 avatar Nov 27 '24 23:11 dungba88

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Dec 14 '24 00:12 github-actions[bot]

I think this is a nice overall approach, adding a new RerankKnnFloatVectorQuery that wraps a KNN query that used quantization to get the initial results.

It's reminiscent of Lucene's existing QueryRescorer, to implement multi-phased ranking, except that class doesn't wrap another Query... maybe it should (separately)!

I'm curious about your results here -- why is recall better for 1bit and 4bit than 7bit, when reranking?

mikemccand avatar Apr 30 '25 12:04 mikemccand

I'm curious about https://github.com/apache/lucene/pull/14009#issuecomment-2502665806 -- why is recall better for 1bit and 4bit than 7bit, when reranking?

The graph is a bit confusing, but the dots are the oversample (from 1 to 5). If we compare the recall with the same oversample, then 7-bit is always better or same. The difference becomes smaller at higher oversample. E.g, at oversample=1 , 7 bit has 20% higher recall than 1-bit but at oversample=5 they are mostly the same.

dungba88 avatar May 05 '25 03:05 dungba88

The graph is a bit confusing, but the dots are the oversample (from 1 to 5).

This is the piece I was missing! Can we edit the benchmark comment and add a line saying each dot represent oversample factor (from 1 to 6, i count 6 dots).

So 4x oversample with 1_bit outperforms 2x oversample with 7_bit, in both latency and recall. And 3x oversample with 4_bit has better recall and latency than 7_bit with no oversample?

Sorry I'm late to the party. Will look at the PR and benchmark more closely and try to think of latency laggers.

vigyasharma avatar May 18 '25 18:05 vigyasharma

Thanks @vigyasharma for the comment! I updated the comment to make it less confusing.

I'll think about generalization, but the idea is that as long as the field can expose the float vectors through getFloatVectorValues(fieldInfo) then it can be used with this Query? Or do you think we need to abstract/generalize this part as well?

utilizing the scorer from the inner query

I can't use the inner scorer here because for quantized codec it will be the quantized vector scorer and only use the quantized vectors.

dungba88 avatar May 21 '25 00:05 dungba88

@dungba88 : I shared an alternate implementation for RerankVectorQuery that I feel can generalize more broadly.

Main Differences:

  1. While the primary use case would be to rerank results from a quantized KnnFloatVectorQuery against full precision vectors, this approach has no tight coupling. Users could provide any inner query and rerank against vectors stored in any field.
  2. This can, in future, be extended to support reranking using Late Interaction models stored in a different field. We just need to override add some checks to ensure multi-vectors in provided field are compatible with provided target. This idea has been floated here earlier
  3. Maintains a regular query flow of keeping heavier computation in Weight and Scorer, instead of heavy matching and scoring logic in rewrite.
  4. This should implicitly use any slicing or per leaf parallel execution provided at searchers. We don't need to rewrite logic for per leaf threads.

Please know that I don't mean to override your work, just needed to write it down to clarify my ideas. I'm happy to collaborate on whatever direction makes most sense.

vigyasharma avatar May 22 '25 07:05 vigyasharma

No worry, I think generalization makes sense to me too. And I like the idea of moving this forward.

The use of KnnFloatVectorQuery in this implementation only adds a small benefit that we can short-circuit when the results are less than k, which is marginal to be honest. Totally agree that we can make it a generic Query.

I'm not sure about the weight part though. Asking just for my understanding of Lucene, I thought Lucene typically rewrites the Query first, and then createWeight on the simplest form. Here we are doing the reverse (createWeight, and inside that call rewrite). Can this break any assumption (e.g caching) in Lucene?

One (subtle) difference I found with the alternative implementation is that in the alternative, we no longer limit the final results by K, so if the inner queries is doing over-sampling, the rerank queries will return the full over-sampled results as well. No strong opinion on what is favor, but usually we would need to reduce the number of results for KNN query down to some size.

dungba88 avatar May 22 '25 12:05 dungba88

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar May 22 '25 13:05 github-actions[bot]

I changed the query to be generic but keep other the same. (Didn't have JDK24 yet so build would fail.)

dungba88 avatar May 22 '25 13:05 dungba88

I haven't dug in deep yet, but along the lines of further generality, would it make sense to accept a DoubleValuesSource (that can compute dot products against a vector field) rather than limiting this to vector dot products only?

msokolov avatar May 22 '25 19:05 msokolov

Lucene typically rewrites the Query first, and then createWeight on the simplest form.

That's a good callout. I'll override the rewrite method to rewrite the inner query. That will keep the query cacheable. If inner query rewrite is only called in createWeight then each weight can get a different set of knn hits which goes against caching.

we no longer limit the final results by K, so if the inner queries is doing over-sampling, the rerank queries will return the full over-sampled results as well.

This gets handled by the k you pass to searcher.search(), and the top docs collector you use. It's same as any other query.

would it make sense to accept a DoubleValuesSource (that can compute dot products against a vector field) rather than limiting this to vector dot products only

interesting idea, let me see what that would look like.

vigyasharma avatar May 22 '25 20:05 vigyasharma

Is it this? https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/FunctionScoreQuery.java

msokolov avatar May 22 '25 20:05 msokolov

This gets handled by the k you pass to searcher.search(), and the top docs collector you use. It's same as any other query.

I guess the main difference is when the results got trimmed down. For the alternative implementation we leave that up to the final hit collection to return the top-k based on a ranking mechanism (as this Query is usually just part of the final Query we send to IndexSearcher) while this implementation of the Query would output top-k by its own. If the final score is determined independently from the score returned by this Query, then the re-ranking we did would be a waste (and we can just over-sample and let the final scorer to do the work)?

With that being said, I have no strong opinion. Wondering what would make sense for the actual use case that it will use (quantization and late-interaction model). We can also use special value (like negative number) to tell us not to trim down the results if we do need both way.

dungba88 avatar May 23 '25 01:05 dungba88

For the alternative implementation we leave that up to the final hit collection to return the top-k based on a ranking mechanism

Those final TopDocs collectors will use the score generated by the Rerank query, which is the rerank mechanism we are implementing.

I'm not really sure I follow the concern here. In both cases, you need to score all the docs to find top-K. In this one, you do it during rewrite and maintain your own HitQueue. In the other, you do it in Scorer and leverage TopDocs collector or whatever collector is needed. The alternate approach is more flexible, you can vary how many hits you want to work with (upto oversample limit). But computationally, it's not more expensive.

vigyasharma avatar May 23 '25 04:05 vigyasharma

Let me rephrase it better. I'm not worrying about the computation cost, as you mentioned they are the same.

In Amazon (where both Vigya and I are working at), we usually run KnnFloatVectorQuery along with keyword matching (such as TermQuery) disjunctively amongst with other constraint filtering. We don't use the scoring in the KnnFloatVectorQuery as well as TermQuery (as we will wrap them with ConstantScoreQuery), and after we get all the hits from both KNN and keyword matching we will use our own scorer to compute the final top-k score. One of the thing I'm hoping to do here is for KnnFloatVectorQuery with low-bit quantization, we can get a better recall with over-sampling and re-ranking with the full-precision vector and cut-off the hits to the original top-k. The end result is that the KnnFloatVectorQuery will still output top-k results but with better recall due to the oversample and reranking. Moreover we don't have to rely on the final scorer to remove/downrank the defects which maybe introduced with the oversampling. If we keep the full over-sample during the Query match phase and only get the final top-k when collecting hits at the end as the alternative approach, then:

  • Wouldn't the Scorer.score() be skipped/not called at all due to the ConstantScoreQuery, and the RerankQuery will essentially be no-op? Asking just for my education.
  • It would not possible (due to the how Weight/Scorer are used) to limit the hits to a fixed K for the RerankQuery alone, before applying it conjunctively/disjunctively with other Query (similar to how AbstractKnnVectorQuery works), wouldn't it? Then it might not be possible to apply for this use case.

If we allow a way to flexibly choose either to trim or not the results from RerankQuery, do you think that would be better and serve both use cases we have here?

dungba88 avatar May 23 '25 06:05 dungba88

Is it this? https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/FunctionScoreQuery.java

@msokolov : Gosh, you're right! This is pretty much what we need. Just pass it a DoubleValuesSource for full precision vector scoring, along with the query you want to wrap.

I noticed we already had DoubleValuesSources for vector similarity, but they didn't compute full precision for quantized vectors. Have added that support here – #14708

Perhaps, we could also add some syntactic sugar on the FunctionScoreQuery for this specific use case, maybe a static getter that uses these double values sources to return a query that reranks by full precision vector similarity.

vigyasharma avatar May 24 '25 08:05 vigyasharma

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Jun 08 '25 00:06 github-actions[bot]

I'm thinking that instead of using vectorValues(...) we can probably use DoubleValueSource as per @msokolov suggestion for a generic rescorer. Then we can use scorer from another (vector) field for re-ranking. This could help with the use case to e.g use 1-bit quantization for matching and 7-bit quantization for re-ranking by indexing an additional fields.

Either way we would not be able to reuse the FunctionScoreQuery due to the fact that we are using ConstantScoreQuery and we need to trim the returned matches from KNN alone to just K.

dungba88 avatar Jun 12 '25 09:06 dungba88

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Jun 13 '25 00:06 github-actions[bot]

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions[bot] avatar Jun 13 '25 01:06 github-actions[bot]

I published another revision to support a custom DoubleValueSource for re-ranking, and we can reuse the FloatVectorSimilarityValuesSource in case we want to rescore with other field, or the FullPrecisionFloatVectorSimilarityValuesSource in https://github.com/apache/lucene/pull/14708.

I also checked the Rescorer in https://github.com/apache/lucene/pull/14776, but it seems that will only affect the final results, and may not work when there are hybrid of keyword and semantic (KNN) matching and that we only need to re-score the KNN part.

dungba88 avatar Jun 13 '25 01:06 dungba88

Thanks for the explanations, @dungba88. I suppose the scenario you're trying to solve for, is when users want to change the matchset of a KnnVectorQuery using full-precision or other reranking. I'm open to this change if it's a valid requirement.

My perception is that it should be solvable by plumbing vector query results into a rescorer, then combining top-K hits with other (lexical) hits. For e.g. is this also a problem for hybrid search in OpenSearch/Elasticsearch ? I suspect they might have independent queries for both with some way to combine results.

vigyasharma avatar Jun 20 '25 19:06 vigyasharma

is when users want to change the matchset of a KnnVectorQuery using full-precision or other reranking

Yes that's correct, @vigyasharma. We are using a hybrid search where KnnFloatVectorQuery and TermQuery (amongst others) are combined into a single BooleanQuery. Thus it is important to change the matchset of KnnFloatVectorQuery individually.

For hybrid search in OpenSearch/Elastic Search, I'm wondering if @jmazanec15 and @benwtrent have any input. I'm having a feeling that it's quite common to combine lexical + KNN matching into a single BooleanQuery.

dungba88 avatar Jun 20 '25 22:06 dungba88

kNN queries are completed in the rewrite phase, if any rescoring needs to be done, it should be done during that full phase.

I would expect the experience to be:

RescoreQueryWithVectorQueryThingy(KnnQuery) and the rescore will occur during rewrite (or atleast provide a scorer that iterates the kNN query results calculating the higher fidelity scores).

kNN should be "just another query" and should be combinable with any other query. I realize this is a bit tricky as kNN is unique in that it effectively "collects" its results up-front.

benwtrent avatar Jun 23 '25 12:06 benwtrent

Sorry for spamming the replies! I should have gone to the Files changed tab, which allow sending all replies in the same message.

dungba88 avatar Jun 24 '25 14:06 dungba88

For hybrid search in OpenSearch/Elastic Search, I'm wondering if @jmazanec15 and @benwtrent have any input. I'm having a feeling that it's quite common to combine lexical + KNN matching into a single BooleanQuery.

Not sure Im following the discussion completely. It is common to combine lexical and k-NN in a boolean query, but I think there is a lot of variety in what/how users are implementing hybrid search, so flexibility is great to have. Like @benwtrent mentioned, result computation for k-NN is done upfront, but queries can also be used to re-score after the initial phase via the QueryRescorer (as @mikemccand mentioned awhile ago), so I dont think a separate rescorer is necessary.

I also like the point around lazy iteration: "or atleast provide a scorer that iterates the kNN query results calculating the higher fidelity scores". For expensive re-scoring (I think multi-vector will be), this might be nice to have too for hybrid/boolean queries - I think this approach is take in FloatVectorSimilarityQuery. But, this can probably be taken for future consideration.

jmazanec15 avatar Jun 24 '25 15:06 jmazanec15