lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Lucene's facets should tap into `IndexSearcher`'s `TaskExecutor` too?

Open mikemccand opened this issue 1 year ago • 2 comments

Description

Spinoff from the exciting discussion on https://github.com/apache/lucene/pull/13472:

Lucene has made great gains recently on intra-query concurrency: using multiple threads (with a "slice" work unit = one or more segments) to reduce latency of queries. Besides faster wall clock time, since CPU is running concurrently, Lucene also can gain efficiency because segments can terminate earlier / start using skipping as the more competitive results from other segments arrive sooner/concurrently, causing less total CPU to be spent to get the top hits for the query.

But I think Lucene's doc values and taxonomy facets do not use any concurrency? Even if you pass a TaskExecutor to IndexSearcher, facet counting will still run single threaded. Can we fix this to also make facet counting faster (net elapsed wall clock time)? It's tricky because some facet counting aggregate into data structures (like int[] or an HPPC int->int map) that are not easily made thread safe?

Note: we do have ConcurrentSortedSetDocValuesFacetCounts which does use concurrency, but the other facet counting (numeric ranges, taxonomy facets) do not. Also, ConcurrentSortedSetDocValuesFacetCounts takes its own ExecutorService not a TaskExecutor.

mikemccand avatar Jun 18 '24 13:06 mikemccand

Also, ConcurrentSortedSetDocValuesFacetCounts takes its own ExecutorService

It also creates one task per segment instead of reusing the IndexSearcher slices, which would be nice to fix too.

(More generally, I wish the facet module behaved a bit more like a regular Lucene Collector, instead of first loading all hits into a bitset and doing the work in a second phase, which is memory-intensive and means that it doesn't automatically benefit from IndexSearcher's existing features like inter-segment concurrency, dynamic pruning, timeout support, or upcoming features like I/O concurrency and intra-segment concurrency.)

jpountz avatar Jun 19 '24 12:06 jpountz

(More generally, I wish the facet module behaved a bit more like a regular Lucene Collector, instead of first loading all hits into a bitset and doing the work in a second phase, which is memory-intensive and means that it doesn't automatically benefit from IndexSearcher's existing features like inter-segment concurrency, dynamic pruning, timeout support, or upcoming features like I/O concurrency and intra-segment concurrency.)

+1 -- then facets would auto-magically get these awesome concurrency improvements in Lucene.

mikemccand avatar Jun 20 '24 12:06 mikemccand