lucene icon indicating copy to clipboard operation
lucene copied to clipboard

GITHUB-11838 Add api to allow concurrent query rewrite

Open zhaih opened this issue 3 years ago • 7 comments

Description

Issue: #11838

Initial Proposal (discarded)

  • Added one public method to Query to allow passing in an executor, and default to just call single thread version
  • In IndexSearcher we always pass in the executor
  • I'm not sure whether we should add a parameter to control the behavior, I'm inclined to believe that users who passed in the executor are willing to make use of concurrent rewrite.
  • Rewritten ConstantScoreQuery so that the query executor will be passed into inner query.

Updated Proposal

  • Change signature of rewrite to rewrite(IndexSearcher)
  • How did I migrate the usage:
    • Use Intellij to do preliminary refactoring for me
    • For test usage, use searcher whenever is available, otherwise create one using newSearcher(reader)
    • For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using new IndexSearcher(reader), tried my best to avoid creating it recurrently (Especially in FieldQuery)
    • For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.

zhaih avatar Oct 07 '22 18:10 zhaih

We already have one class that wraps an IndexReader and an Executor: IndexSearcher. Should this new rewrite method take an IndexSearcher instead of an IndexReader and an Executor?

jpountz avatar Oct 11 '22 07:10 jpountz

(and should it replace the existing rewrite method instead of just adding another one?)

jpountz avatar Oct 11 '22 07:10 jpountz

Hi @jpountz thank for taking a look!

We already have one class that wraps an IndexReader and an Executor: IndexSearcher. Should this new rewrite method take an IndexSearcher instead of an IndexReader and an Executor?

I think it'd be a little weird to know that query rewriting requires an IndexSearcher from a user point of view? Also for people who're not using IndexSearcher provided by lucene (like we're in LinkedIn) it will be a bit inconvenience to create an IndexSearcher just for query rewriting?

(and should it replace the existing rewrite method instead of just adding another one?)

Yes I think replace is better, I previously wanted to make it into 9.x so don't want to change the API, are we able to make it in 9.x even if with the API change?

zhaih avatar Oct 12 '22 07:10 zhaih

I don't think it'd be weird to require an IndexSearcher, Query#rewrite is essentially a way to improve caching and keep Query#createWeight simple. Given that Query#createWeight already creates an IndexSearcher, I'd assume that you already create an IndexSearcher even though you might not be using its search methods?

jpountz avatar Oct 12 '22 11:10 jpountz

@jpountz You're right, our case is a bit complex since currently we're not even using Lucene's Query (but we're planning to in the future!) so I totally forgot the createWeight also takes an IndexSearcher. Yeah now it makes sense to me to change that API to use IndexSearcher as well, I'll try to get one out these days. Thank you for the discussion!

zhaih avatar Oct 13 '22 05:10 zhaih

I think the general idea looks well. This should me Lucene 10 only, as the changes in query API may break lots of code downstream. We could theoretically add a backwards combatiility layer using VirtualMethod (says the policeman), if that's wanted, I can help. VirtualMethod utility class would be needed to figure out which of both methods (rewrite(IndexReader) or rewrite(IndexSearcher)) was overridden by a subclass downstream.

uschindler avatar Oct 14 '22 08:10 uschindler

Agreed on making this Lucene 10.

As far as bw compat is concerned, I was thinking of updating Lucene 9 this way:

  • introduce Query#rewrite(IndexSearcher) and make it delegate to Query#rewrite(IndexReader)
  • keep existing implementations as they are, except possibly the few ones where we'd like to take advantage of the executor
  • deprecate Query#rewrite(IndexReader)
  • make IndexSearcher#rewrite(Query) use Query#rewrite(IndexSearcher)

jpountz avatar Oct 14 '22 08:10 jpountz

Yeah I already made this one only on Lucene 10, I'll make a 9x PR according to Adrien's suggestion, it sounds not too complicated. (Thanks Uwe as well, good to know that we have a VirtualMethod class just for the back compatibility)

zhaih avatar Oct 15 '22 16:10 zhaih

I tried created one but then realized it seems not working, because query rewriting are rewriting query from top of the tree to the bottom so once we delegate to rewrite(IndexReader) we're not able to call the indexSearcher one in the children.

I wonder if people are ok with this only happens after Lucene 10? Or @uschindler can you help evaluate using VirtualMethod? It seems to me it is a tool detecting whether there's difference in implementation, but still we are not able to avoid losing IndexSearcher once we've called a delegated method?

zhaih avatar Oct 15 '22 17:10 zhaih

Hi @zhaih, the problem you describe is exactly why VirtualMethod is there. It allows to figure out which of both methods was overridden by a subclass at runtime. I can look into this next week. Actually it is long ago when we used VirtualMethod and there is currently no impementation actually using it. We used it mostly for Analysis when we did the big change from tokens to attributes and thats already 10 years ago. So I have to bring myself up to state, mainly because the whole thing is "not easy to understand" as there are various code paths that could be affected. To say it short: changing method signatures of protected methods where you have no chance to enforce downstream classes to use new API are not easy. JDK avoids this, we have VirtualMethod :-) We should maybe only merge this to main branch for now and then work separately on backport.

uschindler avatar Oct 16 '22 11:10 uschindler

It seems to me it is a tool detecting whether there's difference in implementation, but still we are not able to avoid losing IndexSearcher once we've called a delegated method?

Yes, basically all classes that delegate rewrite to subqueries need to change to new API, so we still need to backport everything to 9.x and only detect cases where somebody uses the old api.

Of course in the case that somebody writes a subclass that only implements the IndexReader variant of rewrite it won't use the searcher. If this user subclass then rewrites subqueries on its own it will call "rewrite(IndexReader)" and so it is subqueries won't use parallelization. This is not so dramatic as a new query in user's code will just never pass parallelization down its subqueries.

Let me think a few evening about it :-)

uschindler avatar Oct 16 '22 12:10 uschindler

Of course in the case that somebody writes a subclass that only implements the IndexReader variant of rewrite it won't use the searcher. If this user subclass then rewrites subqueries on its own it will call "rewrite(IndexReader)" and so it is subqueries won't use parallelization. This is not so dramatic as a new query in user's code will just never pass parallelization down its subqueries.

Using the same reasoning, I wonder if we'd solve 90% of the problem by propagating Query#rewrite(IndexSearcher) in BooleanQuery, ConstantScoreQuery and DisjunctionMaxQuery (the main queries that can be inner nodes in a query tree).

jpountz avatar Oct 18 '22 08:10 jpountz

About the backport to 9.x I will help soon, at moment I am not well due to COVID after a conference last week.

Thanks Uwe! No hurries and take care!

zhaih avatar Oct 19 '22 16:10 zhaih