Provide the LeafSlice to CollectorManager.newCollector to save memory on small index slices [LUCENE-8542]
I have an index consisting of 44 million documents spread across 60 segments. When I run a query against this index with a huge number of results requested (e.g. 5 million), this query uses more than 5 GB of heap if the IndexSearch was configured to use an ExecutorService.
(I know this kind of query is fairly unusual and it would be better to use paging and searchAfter, but our architecture does not allow this at the moment.)
The reason for the huge memory requirement is that the search will create a TopScoreDocCollector for each segment, each one with numHits = 5 million. This is fine for the large segments, but many of those segments are fairly small and only contain several thousand documents. This wastes a huge amount of memory for queries with large values of numHits on indices with many segments.
Therefore, I propose to change the CollectorManager - interface in the following way:
- change the method newCollector to accept a parameter LeafSlice that can be used to determine the total count of documents in the LeafSlice
- Maybe, in order to remain backwards compatible, it would be possible to introduce this as a new method with a default implementation that calls the old method - otherwise, it probably has to wait for Lucene 8?
- This can then be used to cap numHits for each TopScoreDocCollector to the leafslice-size.
If this is something that would make sense for you, I can try to provide a patch.
Migrated from LUCENE-8542 by Christoph Kaser, updated Mar 13 2019 Attachments: LUCENE-8542.patch
Christoph Kaser (migrated from JIRA)
I attached a patch - hopefully this shows more clearly what I meant.
Since CollectorManager is marked as experimental, I think it might be possible to port this patch against Lucene 7 as well without providing a default implementation of the new method and keeping the old method.
Adrien Grand (@jpountz) (migrated from JIRA)
This doesn't look like a great solution to your problem? If I were you, I'd probably rather fork TopScoreDocCollector to dynamically grow the heap.
Lucene/Solr QA (migrated from JIRA)
| ✔ +1 overall |
|---|
| Vote | Subsystem | Runtime | Comment |
|---|---|---|---|
| Prechecks | |||
| +1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
| master Compile Tests | |||
| +1 | compile | 1m 38s | master passed |
| Patch Compile Tests | |||
| +1 | compile | 0m 51s | the patch passed |
| +1 | javac | 0m 51s | the patch passed |
| +1 | Release audit (RAT) | 0m 30s | the patch passed |
| +1 | Check forbidden APIs | 0m 20s | the patch passed |
| +1 | Validate source patterns | 0m 20s | the patch passed |
| Other Tests | |||
| +1 | unit | 30m 12s | core in the patch passed. |
| +1 | unit | 3m 10s | facet in the patch passed. |
| 40m 30s |
| Subsystem | Report/Notes |
|---|---|
| JIRA Issue | LUCENE-8542 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12950019/LUCENE-8542.patch |
| Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
| uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
| git revision | master / 75b1831 |
| ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
| Default Java | 1.8.0_191 |
| Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/127/testReport/ |
| modules | C: lucene lucene/core lucene/facet U: lucene |
| Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/127/console |
| Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
Christoph Kaser (migrated from JIRA)
I think it would be nice to have the option to grow the heap dynamically. However the way TopScoreDocCollector and TopDocsCollector are currently built, for a lucene user that would mean copying the complete source code for those classes and adopting them to use a java.util.PriorityQueue (probably with worse performance than org.apache.lucene.util.PriorityQueue).
This is certainly possible, but would mean a lot of code duplication (from the perspective of a lucene user, because the used priority queue can't be changed easily),
I think that this patch makes sense anyway: The size of segments has a very wide range in a typical index, and usually there are a lot more small segments than large ones. Given that the default implementation of IndexSearcher.slices() returns one slice per segment, that means a lot of wasted memory for all queries that have a numHits greater than the typical size of a small segment. I don't think it has any negative impact on queries with a small value of numHits, because it only adds one Math.min per segment.
It also helps with my problem: for an index with 28 segments and 13,360,068 documents and a search with numhits=5,000,000, it makes the difference between creating priority queues with a combined size of 140,000,000 vs 13,360,068. As you can see in the following table, there are benefits for searches with a more reasonable numHits value as well (all against my index):
| numHits | Combined size w/o patch | Combined size with patch |
|---|---|---|
| 10,000,000 | 280,000,000 | 13,360,068 |
| 5,000,000 | 140,000,000 | 13,360,068 |
| 1,000,000 | 28,000,000 | 6,870,854 |
| 100,000 | 2,800,000 | 1,632,997 |
| 50,000 | 1,400,000 | 1,015,274 |
| 10,000 | 280,000 | 252,528 |
Christoph Kaser (migrated from JIRA)
Is there anything I can change / add to get this committed? Or do you think it makes no sense for the general use case of lucene?
Adrien Grand (@jpountz) (migrated from JIRA)
I think this goes against the use-case that Lucene is designed to solve indeed, which is to compute the top-k matches of a query for some reasonable value of k. I'd be ok with eg. having a collector in the sandbox that works better when collecting thousands of hits, but I'd like to avoid making changes to the core only to support this use-case.
Christoph Kaser (migrated from JIRA)
That's too bad, given that this is only a minor change to an experimental API (and does not cause extra work in the reasonable use case). But I understand your reasons.
I may try to build such a collector when I find the time (though I suspect this may involve quite a lot of code duplication if no changes to the core should be made) - for now we simply limit the amount of concurrent queries with huge values of numHits so they fit into the heap.
Michael McCandless (@mikemccand) (migrated from JIRA)
I think the core API change is quite minor and reasonable – letting the Collector.newCollector know which segments (slice) it will collect? E.g. we already pass the LeafReaderContext to Collector.newLeafCollector so it's informed about the details of which segment it's about to collect.
I agree the motivating use case here is somewhat abusive, and a custom Collector is probably needed anyway, but I think this API change could help non-abusive cases too.
Alternatively we could explore fixing our default top hits collectors to not pre-allocate the full topN for every slice ... that is really unexpected behavior, and users have tripped up on this multiple times in the past causing us to make some partial fixes for it.
Adrien Grand (@jpountz) (migrated from JIRA)
I don't think this change really helps as the number of documents in a slice is a pretty bad upper bound of the total number of documents that will be collected. If we want to better support this use-case, I'd rather like that we look into making PriorityQueue growable.
Christoph Kaser (migrated from JIRA)
While it's true the slice size is a bad upper bound, the change does help: As you can see in the table in my comment, it reduces the heap requirement by 90% in my use case, due to the large number of small slices.
Making PriorityQueue growable would certainly be a better solution, however it is much harder to do this without affecting the "sane" use case performance.
Michael McCandless (@mikemccand) (migrated from JIRA)
Maybe we should try swapping in the JDK's PriorityQueue and measure if this really hurts search throughput?
Adrien Grand (@jpountz) (migrated from JIRA)
Right I get how it can help with small slices, but at the same time I'm seeing small slices as something that should be avoided in order to limit context switching so I don't think we should design for small slices?
I think the JDK's PriorityQueue would be fine most of the time, except in the worst case that every new hit is competitive, which happens when more recently indexed documents are more likely to be competitive (eg. when sorting by decreasing timestamp, which I'm seeing often) given that a single reordering via updateTop would need to be replaced with one pop and one push. Can we make our own PriorityQueue growable maybe? I'm not too concerned about performance given that only add() would be affected, not top() and updateTop() which are the ones that are important for performance.
Lucene/Solr QA (migrated from JIRA)
| ✔ +1 overall |
|---|
| Vote | Subsystem | Runtime | Comment |
|---|---|---|---|
| Prechecks | |||
| +1 | test4tests | 0m 0s | The patch appears to include 1 new or modified test files. |
| master Compile Tests | |||
| +1 | compile | 0m 38s | master passed |
| Patch Compile Tests | |||
| +1 | compile | 0m 39s | the patch passed |
| +1 | javac | 0m 39s | the patch passed |
| +1 | Release audit (RAT) | 0m 32s | the patch passed |
| +1 | Check forbidden APIs | 0m 27s | the patch passed |
| +1 | Validate source patterns | 0m 27s | the patch passed |
| Other Tests | |||
| +1 | unit | 15m 30s | core in the patch passed. |
| +1 | unit | 1m 2s | facet in the patch passed. |
| 19m 49s |
| Subsystem | Report/Notes |
|---|---|
| JIRA Issue | LUCENE-8542 |
| JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12950019/LUCENE-8542.patch |
| Optional Tests | compile javac unit ratsources checkforbiddenapis validatesourcepatterns |
| uname | Linux lucene1-us-west 4.4.0-137-generic #163~14.04.1-Ubuntu SMP Mon Sep 24 17:14:57 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
| Build tool | ant |
| Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
| git revision | master / d8f2a02 |
| ant | version: Apache Ant(TM) version 1.9.3 compiled on July 24 2018 |
| Default Java | 1.8.0_191 |
| Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/174/testReport/ |
| modules | C: lucene lucene/core lucene/facet U: lucene |
| Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/174/console |
| Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
Christoph Kaser (migrated from JIRA)
Right I get how it can help with small slices, but at the same time I'm seeing small slices as something that should be avoided in order to limit context switching so I don't think we should design for small slices?
Small slices are the default: The default implementation of IndexSearcher.slices() returns one slice per segment. Since the search runs in an Executor, this may not cause a lot of context switching depending on the thread pool parameters. But you are right, the default implementation of slices() may not be optimal.
Adrien Grand (@jpountz) (migrated from JIRA)
the default implementation of slices() may not be optimal.
Let's change it?
Michael McCandless (@mikemccand) (migrated from JIRA)
+1 to improve slices() to aggregate small slices together by default; that's what we are doing in our production service – we combine up to 5 segments, up to 250K docs in aggregate.
I have once again run across this issue, with numHits == 4 000 000, even though the slices are now aggregated like Michael has suggested.
I would still advocate passing the slice size to avoid allocating unnecessarily large buffers per slice when numHits is large. Any chance at all to get this merged?