lucene
lucene copied to clipboard
LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)
Description / Solution
Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager), with the following changes:
- Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector
- Refactor some tests to use the above collector manager
- Refactor DrillSideways to use extracted out collector managers
#11041
Tests
Passed updated tests with ./gradlew clean; ./gradlew check -Ptests.nightly=true
(currently there are nocommit messages that are failing precommit though)
Checklist
Please review the following and check all that apply:
- [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
- [x] I have created a Jira issue and added the issue ID to my pull request title.
- [x] I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
- [x] I have developed this patch against the
main
branch. - [x] I have run
./gradlew check
. - [ ] I have added tests for my changes.
Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)
Thanks @gsmiller for the review! Yes there are indeed many small changes, and more will come :D
Hi @jpountz @gsmiller , just want to check back on this PR to see if you have any further feedback?
Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.
Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.
No worry @jpountz , this PR is indeed large and time consuming! Please take your time and I appreciate your help on the review!
@zacharymorn @gsmiller If we try to do everything in a single PR, I worry that this will become very hard to review. I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector)
and TopXXXCollector#create
by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?
I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector or the internal JoinUtil collectors to a CollectorManager?
Thanks @jpountz for the review! Indeed there are more that need to be migrated as well. In fact, I have been working on the 2nd PR to migrate more "easy" ones, and after that, we still have the following:
Explanation related
ExplanationAsserter
MatchesAsserter
Facet related
FacetsCollector
RandomSamplingFacetsCollector
Grouping related
GroupFacetCollector
TopGroupsCollector
BlockGroupingCollector
FirstPassGroupingCollector
AllGroupHeadsCollector
AllGroupsCollector
Global ordinal related
GlobalOrdinalsCollector
GlobalOrdinalsWithScoreCollector
Anonymous classes
Anonymous Collector
used in JoinUtil
Anonymous Collector
used in TestJoinUtil
Anonymous Collector
used in QueryUtils
Profiling related
ProfilerCollector
MemoryAccountingBitsetCollector
MonitorQueryCollector
Counting related
DocValuesStatsCollector
MyHitCollector
MatchCollector
Result diversification related
DistinctValuesCollector
DiversifiedTopDocsCollector
Other
CachingCollector
TerminateAfterCollector
TimeLimitingCollector
LargeNumHitsTopDocsCollector
As each of these categories / collectors will likely require more thoughts / time to understand the current logic, and come up with CollectorManager
implementation that's also thread-safe, I'm thinking to create follow-up Jira sub-tickets for each of these categories to track them. What do you think?
I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?
Yeah I can definitely follow this strategy as well. The only concern I may have is that, as can be seen above, it may still take some time to migrate all existing collectors to collectorManagers. So using suggestion comment for now and adding deprecation warning at the end may likely see a few more collectors without collectorManagers added accidentally (i.e. folks may not notice the comment when they just code against the IndexSearch#search(Query, Collector)
API)?
hi @zacharymorn I looked at this issue a long while ago, before you started working on it, and I am now catching up. I see you made great progress on it! I also see that reviewing it as a single change is a bit of a challenge due to the size of the PR. Could I help out splitting this in smaller PRs so that we make progress towards getting the change in, step by step?
Hi @zacharymorn -- this change is awesome! The world of servers has rapidly become massively concurrent and Lucene has (generally) been slow to adopt it. I like this hardish switch to the concurrent-friendly CollectorManager
instead of the old single-threaded Collector
. There was tons of good discussion and benchmarking results + iterations to improve them ... should we refresh this PR, resolve conflicts and finish? Thanks @zacharymorn!
Thanks @mikemccand for reminding me on this PR, and sorry for missing your question earlier @javanna ! This has totally fallen out of my radar. @javanna Looking at the codebase, it seems you did make the suggested changes later on and I can probably close this PR ? If this PR is still needed, I can pick it up again.
heya @zacharymorn I worked quite a bit on this last year. I should have addresses all of this little by little, although we are still not very close on deprecating search(Query, Collector). There is still quite a bit of work left to do to remove usages of it.
Thanks @javanna for the quick confirmation! I will pick it back up in the next few days and see what still needs to be done then.
Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from main
, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following:
- Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager
- Refactor some tests to use the above collector manager
Could you please take a look, and let me know if you have any feedback?
We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies)
For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (forceMerge to 1 segment
) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation.
We have a longstanding issue open for this: https://github.com/apache/lucene/issues/9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly).
We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies)
For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (
forceMerge to 1 segment
) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation.We have a longstanding issue open for this: #9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly).
Thanks @mikemccand for sharing the additional insight and context! Looks like there's already a lot of good discussions in that ticket as well. Will definitely take a look at that and see what I can contribute, once most of the work for deprecating IndexSearch#search(Query, Collector)
has been completed.
Looks great, thanks @zacharymorn! I just think we should revert the
lucene/benchmark
change that breaks testing of a custom collector for this first step ...
Thanks @mikemccand for the review feedback and approval!
Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from
main
, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following:
- Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager
- Refactor some tests to use the above collector manager
Could you please take a look, and let me know if you have any feedback?
Hi @jpountz @javanna @gsmiller, just want to have a quick follow-up to see if you may have any further feedback on this PR? I plan to merge it in the next few days if there's no major concern.
Thanks for reviving this PR @zacharymorn ! the changes look good to me, having top score doc and top field collector managers sounds like a natural next step, and removes code duplication. I wish that we could avoid having the supportConcurrency flag in them, but that's ok for now, and I don't have a good alternative in mind to avoid unnecessary overhead when there's a single slice.
One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?
Thanks @javanna for the feedback!
One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?
My current preference is actually to get the deprecation annotation in early, so that while we take on the remaining work to migrate other collector usage (which could take some time due to the amount of it), other contributors / application developers could start to take note of those deprecation messages, and stay away from using search(Query, Collector)
in favor of search(Query, CollectorManager)
as early as possible. Otherwise, there may be more and more of those usage being added into lucene and we will always be playing catch-up to migrate them until we deprecate those methods. What do you think?
That is fine with me @zacharymorn . Indeed I have observed as well that there will be new usages introduced while we work on removing current usages, and deprecating early can help with that.
heya @zacharymorn given that this is a deprecation, I guess you meant on backporting to branch_9x as well?
Hi @javanna , I was actually thinking to have it for 10.0.0
(added an entry into that section in CHANGES.txt
), as deprecating IndexSearcher#search(Query, Collector)
has a rather large impact to lucene applications (not that it requires immediate changes, but the method itself is very frequently used and hence requires some attentions). But I can change it to branch_9x instead if that's preferred?
I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?
I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?
Do you know what's the timeline we have for releasing version 10? Based on my quick look over the last few days, most likely it will take a good few months for us to completely remove that usage in lucene codebase (I'm also a bit occupied at the moment), hence we may need to delay the removal if version 10 is planned to be released within like a year.
With the above said, like I shared earlier, I feel getting the deprecation warning message into our codebase as early as possible is critical for lucene contributors to start moving away from using that method, so that we won't be playing catch-up with ourselves and will be able to remove that method one day. Checking the deprecation warning into main
already achieved this purpose. It will be great for lucene application developer to see this as early as possible as well, but its not as urgent (and it will take them longer time to migrate anyway), hence I was having it for version 10 given its potential impact. But I could change it to be released for 9x if that's still preferred.
I think that it would be even better to get the deprecation out with 9x, regardless of whether we will effectively remove with 10. At least we let users know that they should use a different method. I don't see advantages around delaying a deprecation somehow. Thoughts @jpountz ?
+1 to backport the deprecation message/tags to 9.x, even if we cannot fully remove our own internal usage of the deprecated APIs before releasing 10.0. The two can / should be decoupled since it's so much work?
Let's also open a dedicated spinoff issue to "remove all deprecated IS.search usage / IS.search methods in 10.0", and mark it as Blocker for now? If 10.0 wants to get out and we haven't done this blocker we can revisit how/when to do it, whether to downgrade, etc.?
Thanks both @javanna @mikemccand . I have adjusted the change entry and opened a backport PR, as well as a new spinoff issue for 10.0.0
https://github.com/apache/lucene/issues/12892.