lucene icon indicating copy to clipboard operation
lucene copied to clipboard

Allow Collectors to re-order segments for non-exhaustive searches

Open romseygeek opened this issue 1 month ago • 3 comments

Top-k searches using result pruning can already skip segments if they know that the best possible match in that segment is uncompetitive. We can take this a step further by looking at the minimum and maximum values of a field in each segment, and then ordering the segments such that those with more competitive values in general come earlier in the search. This is particularly useful for adversarial cases such as a query sorted by timestamp over an index with an inverse timestamp sort.

romseygeek avatar Nov 19 '25 16:11 romseygeek

did you run any benchmark to see how this impacts performance ?

Not yet, am playing with luceneutil to see if I can add some adverse sort queries. Will report numbers back here.

romseygeek avatar Nov 20 '25 09:11 romseygeek

I added a couple of sorted MatchAll queries to wikimedium.10M.tasks and tested this out on an index sorted by lastMod. In this case it basically doesn't make any difference at all:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
        MatchAllDateTimeDescSort       29.30     (37.7%)       27.29     (23.6%)   -6.9% ( -49% -   87%) 0.490
           HighTermDayOfYearSort       42.40     (10.8%)       40.19     (10.7%)   -5.2% ( -24% -   18%) 0.126
            TermDateTimeDescSort      222.80      (4.0%)      219.28      (4.4%)   -1.6% (  -9% -    7%) 0.236
            HighTermTitleBDVSort        6.88      (4.0%)        6.87      (3.2%)   -0.1% (  -7% -    7%) 0.904
            MatchAllDateTimeSort        9.01     (11.3%)        9.04      (9.6%)    0.3% ( -18% -   23%) 0.921
                        PKLookup      130.26      (2.3%)      130.92      (2.2%)    0.5% (  -3% -    5%) 0.478
                      TermDTSort       52.50     (11.2%)       53.30     (15.5%)    1.5% ( -22% -   31%) 0.721
               HighTermMonthSort       37.38      (9.4%)       39.34      (9.2%)    5.2% ( -12% -   26%) 0.074

The lastMod values are fairly evenly distributed between segments, so segment sorting doesn't really have an effect. I think a more interesting experiment would be with something like time series data where the input is naturally close to sorted and so the sort values in segments are mostly disjoint. I'll see if I can mock something up and run these tests again.

On the plus side, it seems that there isn't a noticeable penalty for doing this sorting, so the escape hatch may not be necessary. But I want to make sure that there are actually existing benefits as well!

romseygeek avatar Dec 01 '25 10:12 romseygeek

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Dec 16 '25 00:12 github-actions[bot]

For less concurrent searching, the optimization would be more impactful if the list of concurrent search tasks was also sorted before being submitted to the TaskExecutor probably. The slices() calculation already uses implicit sorting to evenly distribute the size of the leaf slices, so this could be added there instead to order both the slice partitions as well as the array of LeafSlices returned, so that the executor will try to search the slices in the optimal order.

HoustonPutman avatar Dec 16 '25 21:12 HoustonPutman

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

github-actions[bot] avatar Dec 31 '25 00:12 github-actions[bot]