Allow Collectors to re-order segments for non-exhaustive searches
Top-k searches using result pruning can already skip segments if they know that the best possible match in that segment is uncompetitive. We can take this a step further by looking at the minimum and maximum values of a field in each segment, and then ordering the segments such that those with more competitive values in general come earlier in the search. This is particularly useful for adversarial cases such as a query sorted by timestamp over an index with an inverse timestamp sort.
did you run any benchmark to see how this impacts performance ?
Not yet, am playing with luceneutil to see if I can add some adverse sort queries. Will report numbers back here.
I added a couple of sorted MatchAll queries to wikimedium.10M.tasks and tested this out on an index sorted by lastMod. In this case it basically doesn't make any difference at all:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
MatchAllDateTimeDescSort 29.30 (37.7%) 27.29 (23.6%) -6.9% ( -49% - 87%) 0.490
HighTermDayOfYearSort 42.40 (10.8%) 40.19 (10.7%) -5.2% ( -24% - 18%) 0.126
TermDateTimeDescSort 222.80 (4.0%) 219.28 (4.4%) -1.6% ( -9% - 7%) 0.236
HighTermTitleBDVSort 6.88 (4.0%) 6.87 (3.2%) -0.1% ( -7% - 7%) 0.904
MatchAllDateTimeSort 9.01 (11.3%) 9.04 (9.6%) 0.3% ( -18% - 23%) 0.921
PKLookup 130.26 (2.3%) 130.92 (2.2%) 0.5% ( -3% - 5%) 0.478
TermDTSort 52.50 (11.2%) 53.30 (15.5%) 1.5% ( -22% - 31%) 0.721
HighTermMonthSort 37.38 (9.4%) 39.34 (9.2%) 5.2% ( -12% - 26%) 0.074
The lastMod values are fairly evenly distributed between segments, so segment sorting doesn't really have an effect. I think a more interesting experiment would be with something like time series data where the input is naturally close to sorted and so the sort values in segments are mostly disjoint. I'll see if I can mock something up and run these tests again.
On the plus side, it seems that there isn't a noticeable penalty for doing this sorting, so the escape hatch may not be necessary. But I want to make sure that there are actually existing benefits as well!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
For less concurrent searching, the optimization would be more impactful if the list of concurrent search tasks was also sorted before being submitted to the TaskExecutor probably. The slices() calculation already uses implicit sorting to evenly distribute the size of the leaf slices, so this could be added there instead to order both the slice partitions as well as the array of LeafSlices returned, so that the executor will try to search the slices in the optimal order.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!