lucene
lucene copied to clipboard
GITHUB-11838 Add api to allow concurrent query rewrite
Description
Issue: #11838
Initial Proposal (discarded)
- Added one public method to
Queryto allow passing in an executor, and default to just call single thread version - In
IndexSearcherwe always pass in the executor - I'm not sure whether we should add a parameter to control the behavior, I'm inclined to believe that users who passed in the executor are willing to make use of concurrent rewrite.
- Rewritten ConstantScoreQuery so that the query executor will be passed into inner query.
Updated Proposal
- Change signature of rewrite to
rewrite(IndexSearcher) - How did I migrate the usage:
- Use Intellij to do preliminary refactoring for me
- For test usage, use searcher whenever is available, otherwise create one using
newSearcher(reader) - For very few non-test classes which doesn't have IndexSearcher available but called rewrite, create a searcher using
new IndexSearcher(reader), tried my best to avoid creating it recurrently (Especially inFieldQuery) - For queries who have implemented the rewrite and uses some part of reader's functionality, use shortcut method when possible, otherwise pull out the reader from indexSearcher.
We already have one class that wraps an IndexReader and an Executor: IndexSearcher. Should this new rewrite method take an IndexSearcher instead of an IndexReader and an Executor?
(and should it replace the existing rewrite method instead of just adding another one?)
Hi @jpountz thank for taking a look!
We already have one class that wraps an IndexReader and an Executor: IndexSearcher. Should this new rewrite method take an IndexSearcher instead of an IndexReader and an Executor?
I think it'd be a little weird to know that query rewriting requires an IndexSearcher from a user point of view? Also for people who're not using IndexSearcher provided by lucene (like we're in LinkedIn) it will be a bit inconvenience to create an IndexSearcher just for query rewriting?
(and should it replace the existing rewrite method instead of just adding another one?)
Yes I think replace is better, I previously wanted to make it into 9.x so don't want to change the API, are we able to make it in 9.x even if with the API change?
I don't think it'd be weird to require an IndexSearcher, Query#rewrite is essentially a way to improve caching and keep Query#createWeight simple. Given that Query#createWeight already creates an IndexSearcher, I'd assume that you already create an IndexSearcher even though you might not be using its search methods?
@jpountz You're right, our case is a bit complex since currently we're not even using Lucene's Query (but we're planning to in the future!) so I totally forgot the createWeight also takes an IndexSearcher. Yeah now it makes sense to me to change that API to use IndexSearcher as well, I'll try to get one out these days. Thank you for the discussion!
I think the general idea looks well.
This should me Lucene 10 only, as the changes in query API may break lots of code downstream. We could theoretically add a backwards combatiility layer using VirtualMethod (says the policeman), if that's wanted, I can help. VirtualMethod utility class would be needed to figure out which of both methods (rewrite(IndexReader) or rewrite(IndexSearcher)) was overridden by a subclass downstream.
Agreed on making this Lucene 10.
As far as bw compat is concerned, I was thinking of updating Lucene 9 this way:
- introduce
Query#rewrite(IndexSearcher)and make it delegate toQuery#rewrite(IndexReader) - keep existing implementations as they are, except possibly the few ones where we'd like to take advantage of the executor
- deprecate
Query#rewrite(IndexReader) - make
IndexSearcher#rewrite(Query)useQuery#rewrite(IndexSearcher)
Yeah I already made this one only on Lucene 10, I'll make a 9x PR according to Adrien's suggestion, it sounds not too complicated. (Thanks Uwe as well, good to know that we have a VirtualMethod class just for the back compatibility)
I tried created one but then realized it seems not working, because query rewriting are rewriting query from top of the tree to the bottom so once we delegate to rewrite(IndexReader) we're not able to call the indexSearcher one in the children.
I wonder if people are ok with this only happens after Lucene 10? Or @uschindler can you help evaluate using VirtualMethod? It seems to me it is a tool detecting whether there's difference in implementation, but still we are not able to avoid losing IndexSearcher once we've called a delegated method?
Hi @zhaih, the problem you describe is exactly why VirtualMethod is there. It allows to figure out which of both methods was overridden by a subclass at runtime. I can look into this next week. Actually it is long ago when we used VirtualMethod and there is currently no impementation actually using it. We used it mostly for Analysis when we did the big change from tokens to attributes and thats already 10 years ago. So I have to bring myself up to state, mainly because the whole thing is "not easy to understand" as there are various code paths that could be affected. To say it short: changing method signatures of protected methods where you have no chance to enforce downstream classes to use new API are not easy. JDK avoids this, we have VirtualMethod :-) We should maybe only merge this to main branch for now and then work separately on backport.
It seems to me it is a tool detecting whether there's difference in implementation, but still we are not able to avoid losing IndexSearcher once we've called a delegated method?
Yes, basically all classes that delegate rewrite to subqueries need to change to new API, so we still need to backport everything to 9.x and only detect cases where somebody uses the old api.
Of course in the case that somebody writes a subclass that only implements the IndexReader variant of rewrite it won't use the searcher. If this user subclass then rewrites subqueries on its own it will call "rewrite(IndexReader)" and so it is subqueries won't use parallelization. This is not so dramatic as a new query in user's code will just never pass parallelization down its subqueries.
Let me think a few evening about it :-)
Of course in the case that somebody writes a subclass that only implements the IndexReader variant of rewrite it won't use the searcher. If this user subclass then rewrites subqueries on its own it will call "rewrite(IndexReader)" and so it is subqueries won't use parallelization. This is not so dramatic as a new query in user's code will just never pass parallelization down its subqueries.
Using the same reasoning, I wonder if we'd solve 90% of the problem by propagating Query#rewrite(IndexSearcher) in BooleanQuery, ConstantScoreQuery and DisjunctionMaxQuery (the main queries that can be inner nodes in a query tree).
About the backport to 9.x I will help soon, at moment I am not well due to COVID after a conference last week.
Thanks Uwe! No hurries and take care!