lucene icon indicating copy to clipboard operation
lucene copied to clipboard

LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

Open zacharymorn opened this issue 3 years ago • 7 comments

Description / Solution

Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager), with the following changes:

  1. Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector
  2. Refactor some tests to use the above collector manager
  3. Refactor DrillSideways to use extracted out collector managers

#11041

Tests

Passed updated tests with ./gradlew clean; ./gradlew check -Ptests.nightly=true (currently there are nocommit messages that are failing precommit though)

Checklist

Please review the following and check all that apply:

  • [x] I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • [x] I have created a Jira issue and added the issue ID to my pull request title.
  • [x] I have given Lucene maintainers access to contribute to my PR branch. (optional but recommended)
  • [x] I have developed this patch against the main branch.
  • [x] I have run ./gradlew check.
  • [ ] I have added tests for my changes.

zacharymorn avatar Aug 12 '21 05:08 zacharymorn

Left a few comments. Looks good overall. Lots of little changes! Those test cases... :)

Thanks @gsmiller for the review! Yes there are indeed many small changes, and more will come :D

zacharymorn avatar Aug 20 '21 07:08 zacharymorn

Hi @jpountz @gsmiller , just want to check back on this PR to see if you have any further feedback?

zacharymorn avatar Sep 10 '21 08:09 zacharymorn

Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.

jpountz avatar Sep 15 '21 16:09 jpountz

Sorry @zacharymorn I have not forgotten about it but the change is very large and it's hard to find enough adjacent time to review it. I'll do my best to find time in the coming week.

No worry @jpountz , this PR is indeed large and time consuming! Please take your time and I appreciate your help on the review!

zacharymorn avatar Sep 16 '21 06:09 zacharymorn

@zacharymorn @gsmiller If we try to do everything in a single PR, I worry that this will become very hard to review. I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?

jpountz avatar Sep 16 '21 07:09 jpountz

I left some comments but the approach that you took so far looks good to me. I see that we still need to migrate some collectors like LargeNumHitsTopDocsCollector or the internal JoinUtil collectors to a CollectorManager?

Thanks @jpountz for the review! Indeed there are more that need to be migrated as well. In fact, I have been working on the 2nd PR to migrate more "easy" ones, and after that, we still have the following:

Explanation related

ExplanationAsserter MatchesAsserter

Facet related

FacetsCollector RandomSamplingFacetsCollector

Grouping related

GroupFacetCollector TopGroupsCollector BlockGroupingCollector FirstPassGroupingCollector AllGroupHeadsCollector AllGroupsCollector

Global ordinal related

GlobalOrdinalsCollector GlobalOrdinalsWithScoreCollector

Anonymous classes

Anonymous Collector used in JoinUtil Anonymous Collector used in TestJoinUtil Anonymous Collector used in QueryUtils

Profiling related

ProfilerCollector MemoryAccountingBitsetCollector MonitorQueryCollector

Counting related

DocValuesStatsCollector MyHitCollector MatchCollector

Result diversification related

DistinctValuesCollector DiversifiedTopDocsCollector

Other

CachingCollector TerminateAfterCollector TimeLimitingCollector LargeNumHitsTopDocsCollector

As each of these categories / collectors will likely require more thoughts / time to understand the current logic, and come up with CollectorManager implementation that's also thread-safe, I'm thinking to create follow-up Jira sub-tickets for each of these categories to track them. What do you think?

I wonder if we should split by replacing the deprecation warnings of IndexSearcher#search(Query,Collector) and TopXXXCollector#create by some words about how the CollectorManager variants are recommended as they make it possible to search using multiple threads. Then when this PR is merged we can work more iteratively on replacing usage of Collector with CollectorManager, and we could add the deprecation warnings in the end once we have migrated all our code?

Yeah I can definitely follow this strategy as well. The only concern I may have is that, as can be seen above, it may still take some time to migrate all existing collectors to collectorManagers. So using suggestion comment for now and adding deprecation warning at the end may likely see a few more collectors without collectorManagers added accidentally (i.e. folks may not notice the comment when they just code against the IndexSearch#search(Query, Collector) API)?

zacharymorn avatar Sep 17 '21 04:09 zacharymorn

hi @zacharymorn I looked at this issue a long while ago, before you started working on it, and I am now catching up. I see you made great progress on it! I also see that reviewing it as a single change is a bit of a challenge due to the size of the PR. Could I help out splitting this in smaller PRs so that we make progress towards getting the change in, step by step?

javanna avatar Jan 10 '22 10:01 javanna

Hi @zacharymorn -- this change is awesome! The world of servers has rapidly become massively concurrent and Lucene has (generally) been slow to adopt it. I like this hardish switch to the concurrent-friendly CollectorManager instead of the old single-threaded Collector. There was tons of good discussion and benchmarking results + iterations to improve them ... should we refresh this PR, resolve conflicts and finish? Thanks @zacharymorn!

mikemccand avatar Nov 02 '23 11:11 mikemccand

Thanks @mikemccand for reminding me on this PR, and sorry for missing your question earlier @javanna ! This has totally fallen out of my radar. @javanna Looking at the codebase, it seems you did make the suggested changes later on and I can probably close this PR ? If this PR is still needed, I can pick it up again.

zacharymorn avatar Nov 02 '23 16:11 zacharymorn

heya @zacharymorn I worked quite a bit on this last year. I should have addresses all of this little by little, although we are still not very close on deprecating search(Query, Collector). There is still quite a bit of work left to do to remove usages of it.

javanna avatar Nov 03 '23 14:11 javanna

Thanks @javanna for the quick confirmation! I will pick it back up in the next few days and see what still needs to be done then.

zacharymorn avatar Nov 03 '23 20:11 zacharymorn

Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from main, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following:

  • Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager
  • Refactor some tests to use the above collector manager

Could you please take a look, and let me know if you have any feedback?

zacharymorn avatar Nov 08 '23 07:11 zacharymorn

We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies)

For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (forceMerge to 1 segment) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation.

We have a longstanding issue open for this: https://github.com/apache/lucene/issues/9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly).

mikemccand avatar Nov 08 '23 10:11 mikemccand

We are still a ways away (from seeing Lucene fully utilize available hardware concurrency available at search time to reduce query latencies)

For example: query concurrency is still tied to segment geometry, which is insane. An "optimized" (forceMerge to 1 segment) index loses all of its concurrency! Hideously, at my team (Amazon product search) we had to create a neutered merge policy that intentionally avoids "accidentally" making a lopsided index (where too many docs are in a single segment) because it harms our long pole query latencies (we run all queries concurrently, except close to red-line), due to this silly Lucene limitation.

We have a longstanding issue open for this: #9721, and it ought not be so difficult to fix because Lucene already has strong support for one thread to search a "slice" of a big segment, i.e. we can make the thread work units slices of segments instead of whole segments or collections of segments (what IndexSearcher now calls slices, confusingly).

Thanks @mikemccand for sharing the additional insight and context! Looks like there's already a lot of good discussions in that ticket as well. Will definitely take a look at that and see what I can contribute, once most of the work for deprecating IndexSearch#search(Query, Collector) has been completed.

zacharymorn avatar Nov 09 '23 03:11 zacharymorn

Looks great, thanks @zacharymorn! I just think we should revert the lucene/benchmark change that breaks testing of a custom collector for this first step ...

Thanks @mikemccand for the review feedback and approval!

zacharymorn avatar Nov 14 '23 07:11 zacharymorn

Hi @mikemccand @jpountz @javanna @gsmiller , I have updated this PR to pick up the latest from main, as well as revert some changes to save them for follow-up PRs that address other collectors. This PR is now focused on the following:

  • Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager
  • Refactor some tests to use the above collector manager

Could you please take a look, and let me know if you have any feedback?

Hi @jpountz @javanna @gsmiller, just want to have a quick follow-up to see if you may have any further feedback on this PR? I plan to merge it in the next few days if there's no major concern.

zacharymorn avatar Nov 15 '23 08:11 zacharymorn

Thanks for reviving this PR @zacharymorn ! the changes look good to me, having top score doc and top field collector managers sounds like a natural next step, and removes code duplication. I wish that we could avoid having the supportConcurrency flag in them, but that's ok for now, and I don't have a good alternative in mind to avoid unnecessary overhead when there's a single slice.

One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?

javanna avatar Nov 16 '23 10:11 javanna

Thanks @javanna for the feedback!

One thing that I wonder is whether we are ok already deprecating search(Query, Collector) given that we have a lot of usages still within Lucene. I was thinking that we may want to replace them all before deprecating, and then remove the method in main, but this is where I got distracted from this project last year and ended up dropping it after all. Should we consider getting your changes in, minus the deprecation, perhaps, and split the work on removing usages of the method?

My current preference is actually to get the deprecation annotation in early, so that while we take on the remaining work to migrate other collector usage (which could take some time due to the amount of it), other contributors / application developers could start to take note of those deprecation messages, and stay away from using search(Query, Collector) in favor of search(Query, CollectorManager) as early as possible. Otherwise, there may be more and more of those usage being added into lucene and we will always be playing catch-up to migrate them until we deprecate those methods. What do you think?

zacharymorn avatar Nov 17 '23 00:11 zacharymorn

That is fine with me @zacharymorn . Indeed I have observed as well that there will be new usages introduced while we work on removing current usages, and deprecating early can help with that.

javanna avatar Nov 21 '23 12:11 javanna

heya @zacharymorn given that this is a deprecation, I guess you meant on backporting to branch_9x as well?

javanna avatar Nov 30 '23 19:11 javanna

Hi @javanna , I was actually thinking to have it for 10.0.0 (added an entry into that section in CHANGES.txt), as deprecating IndexSearcher#search(Query, Collector) has a rather large impact to lucene applications (not that it requires immediate changes, but the method itself is very frequently used and hence requires some attentions). But I can change it to branch_9x instead if that's preferred?

zacharymorn avatar Nov 30 '23 22:11 zacharymorn

I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?

javanna avatar Dec 04 '23 19:12 javanna

I think we would do the removal in 10, as that's the type of change that we would make for a major release. Or can decide to delay the removal if need be, but delaying the deprecation kind of defeats the purpose of deprecating?

Do you know what's the timeline we have for releasing version 10? Based on my quick look over the last few days, most likely it will take a good few months for us to completely remove that usage in lucene codebase (I'm also a bit occupied at the moment), hence we may need to delay the removal if version 10 is planned to be released within like a year.

With the above said, like I shared earlier, I feel getting the deprecation warning message into our codebase as early as possible is critical for lucene contributors to start moving away from using that method, so that we won't be playing catch-up with ourselves and will be able to remove that method one day. Checking the deprecation warning into main already achieved this purpose. It will be great for lucene application developer to see this as early as possible as well, but its not as urgent (and it will take them longer time to migrate anyway), hence I was having it for version 10 given its potential impact. But I could change it to be released for 9x if that's still preferred.

zacharymorn avatar Dec 04 '23 20:12 zacharymorn

I think that it would be even better to get the deprecation out with 9x, regardless of whether we will effectively remove with 10. At least we let users know that they should use a different method. I don't see advantages around delaying a deprecation somehow. Thoughts @jpountz ?

javanna avatar Dec 06 '23 20:12 javanna

+1 to backport the deprecation message/tags to 9.x, even if we cannot fully remove our own internal usage of the deprecated APIs before releasing 10.0. The two can / should be decoupled since it's so much work?

Let's also open a dedicated spinoff issue to "remove all deprecated IS.search usage / IS.search methods in 10.0", and mark it as Blocker for now? If 10.0 wants to get out and we haven't done this blocker we can revisit how/when to do it, whether to downgrade, etc.?

mikemccand avatar Dec 07 '23 11:12 mikemccand

Thanks both @javanna @mikemccand . I have adjusted the change entry and opened a backport PR, as well as a new spinoff issue for 10.0.0 https://github.com/apache/lucene/issues/12892.

zacharymorn avatar Dec 07 '23 22:12 zacharymorn