lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Index level caching policy is thrashed by segment-specific query rewrites

Open GovindBalaji-S-Glean opened this issue 5 months ago • 7 comments

Description

  1. Each IndexSearcher has its own UsageTrackingQueryCachingPolicy that is shared across all segments.
  2. This caching policy uses a 256-length ring buffer to keep track of recently used queries.
  3. A TermInSetQuery with rewriteMethod = MultiTermQuery.CONSTANT_SCORE_BLENDED_REWRITE yields a RewritingWeight.
  4. Getting a scorer from this RewritingWeight for a segment could involve rewriting to a BooleanQuery of multiple TermQuery with only the terms present in that particular segment - ref org.apache.lucene.search.AbstractMultiTermQueryConstantScoreWrapper.RewritingWeight#scorerSupplier
  5. Thus a single TermInSetQuery will end up thrashing the ring buffer as multiple distinct BooleanQuerys from different segments.
  6. This leads to a poor caching rate for indexes with a large number of segments.

We could verify this behavior with a new caching policy that delegates to UsageTrackingQueryCachingPolicy after logging the onUse() and shouldCache() calls.

Is there a good reason to not have this ring buffer tracking at a per segment level? That would fix this issue.

Version and environment details

Lucene 9.12.1

GovindBalaji-S-Glean avatar Jul 23 '25 10:07 GovindBalaji-S-Glean

Is there a good reason to not have this ring buffer tracking at a per segment level?

To your point, there are many segments, so this would increase heap usage. Furthermore, the typical case is that segments need to process the same query as the top-level, so this wouldn't help much in other cases than the one that your are describing.

Passing these per-segment rewrites through the cache feels wrong. The cache should never see these per-segment rewrites, only the TermInSetQuery? I believe that we could fix it by calling Query#createWeight rather than IndexSearcher#createWeight in AbstractMultiTermQueryConstantScoreWrapper.

jpountz avatar Jul 25 '25 13:07 jpountz

AbstractMultiTermQueryConstantScoreWrapper itself only calls Query#createWeight. However, ConstantScoreQuery#createWeight and BooleanQuery#createWeight call IndexSearcher#createWeight. We might have to change a lot of places this way.

To your point, there are many segments, so this would increase heap usage.

Since we are fixing length here to 256, UsageTrackingQueryCachingPolicy looks like it would be < 800 integers per instance. Is this a lot given we anyway have 1 org.apache.lucene.search.LRUQueryCache.LeafCache per segment?

GovindBalaji-S-Glean avatar Jul 28 '25 09:07 GovindBalaji-S-Glean

I agree that it may not be much. I still like fixing AbstractMultiTermQueryConstantScoreWrapper better?

jpountz avatar Jul 28 '25 12:07 jpountz

Thanks for looking into this. It looks like AbstractMultiTermQueryConstantScoreWrapper already does create the weights directly without involving the IndexSearcher - https://github.com/apache/lucene/blob/46a4fee65580a35a39b1fb4d04a5d87b3ec7d1f2/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L166

However, the q here is a ConstantScoreQuery composing a BooleanQuery composing multiple TermQuery. So, even if AbstractMultiTermQueryConstantScoreWrapper does not involve the cache, ConstantScoreQuery and BooleanQuery end up doing: https://github.com/apache/lucene/blob/46a4fee65580a35a39b1fb4d04a5d87b3ec7d1f2/lucene/core/src/java/org/apache/lucene/search/ConstantScoreQuery.java#L136 and https://github.com/apache/lucene/blob/46a4fee65580a35a39b1fb4d04a5d87b3ec7d1f2/lucene/core/src/java/org/apache/lucene/search/BooleanWeight.java#L56-L60

The above two involving the cache doesn't look wrong to me since we expect subqueries of a boolean query also to be cached.

I am not sure if we can or want to pass any thing down to these ConstantScoreQuery (or BooleanQuery) such that they don't involve their wrapped (or subqueries) through the cache. Even then, it feels reasonable to allow caching subqueries of a segment specific rewrite too in the general case(In this case however, the subqueries are just TermQuery and it's ok if we don't cache)

Am I missing something else we can fix in AbstractMultiTermQueryConstantScoreWrapper here?

GovindBalaji-S-Glean avatar Jul 29 '25 11:07 GovindBalaji-S-Glean

One hack I can think of is that instead of searcher here, we send in a decorator of searcher with IndexSearcher#createWeight(query) just doing query.createWeight(). Then these per segment rewrites would not get cached. https://github.com/apache/lucene/blob/46a4fee65580a35a39b1fb4d04a5d87b3ec7d1f2/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L166

What do you think? I am happy to implement.

GovindBalaji-S-Glean avatar Aug 01 '25 16:08 GovindBalaji-S-Glean

I was wondering about a variant of your idea that would consist of not reusing the provided IndexSearcher but creating a private one that doesn't cache (IndexSearcher#setQueryCache(null)). Let's discuss implementation details over a PR? This sounds like a viable direction.

jpountz avatar Aug 01 '25 18:08 jpountz