lucene
lucene copied to clipboard
Reduce the number of comparisons when lowerPoint is equal to upperPoint
Description
When lowerPoint is equal to upperPoint. In fact, there is no need to compare lowerPoint and upperPoint at the same time. The number of comparisons can be reduced by half when collecting document ids to construct bitsets and deeply traversing the bkd tree. During my reading of Elasticsearch related code, I found that when executing term or terms queries on the date field, the query is rewritten, and a single term is rewritten as a rang query (lowerTerm is the same as upperTerm). The terms query uses BooleanQuery to encapsulate multiple range queries. Therefore, it is more suitable for this scenario, reducing the number of comparisons and improving performance when collecting a large number of documents.
Related #14267
Addressed review comments, and benchmark results below:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
IntSet 437.40 (12.5%) 420.59 (7.3%) -3.8% ( -21% - 18%) 0.235
BrowseMonthTaxoFacets 2.55 (7.1%) 2.48 (6.7%) -2.7% ( -15% - 11%) 0.219
PKLookup 170.61 (5.2%) 166.40 (9.8%) -2.5% ( -16% - 13%) 0.322
BrowseMonthSSDVFacets 3.63 (19.5%) 3.54 (13.1%) -2.4% ( -29% - 37%) 0.641
Fuzzy1 72.71 (7.1%) 71.37 (6.8%) -1.8% ( -14% - 13%) 0.403
AndHighMedDayTaxoFacets 77.25 (4.2%) 76.05 (5.1%) -1.5% ( -10% - 8%) 0.295
LowIntervalsOrdered 23.31 (7.5%) 23.01 (7.7%) -1.3% ( -15% - 15%) 0.593
MedTermDayTaxoFacets 17.96 (4.7%) 17.77 (5.3%) -1.1% ( -10% - 9%) 0.495
HighSpanNear 9.69 (4.6%) 9.60 (4.6%) -1.0% ( -9% - 8%) 0.506
AndHighHigh 84.54 (4.0%) 83.74 (3.6%) -1.0% ( -8% - 6%) 0.425
LowSpanNear 25.14 (7.8%) 24.91 (8.3%) -0.9% ( -15% - 16%) 0.721
HighTermMonthSort 1133.10 (6.1%) 1123.71 (4.9%) -0.8% ( -11% - 10%) 0.637
MedPhrase 281.29 (4.5%) 279.24 (5.0%) -0.7% ( -9% - 9%) 0.629
HighIntervalsOrdered 12.78 (7.1%) 12.72 (6.5%) -0.5% ( -13% - 14%) 0.827
OrHighMedDayTaxoFacets 4.93 (5.5%) 4.91 (8.0%) -0.5% ( -13% - 13%) 0.830
AndHighMed 137.39 (2.1%) 136.76 (2.4%) -0.5% ( -4% - 4%) 0.519
BrowseDateSSDVFacets 0.63 (10.7%) 0.62 (13.3%) -0.4% ( -22% - 26%) 0.914
OrNotHighMed 360.87 (5.1%) 359.45 (3.8%) -0.4% ( -8% - 8%) 0.782
HighSloppyPhrase 13.07 (5.7%) 13.03 (5.0%) -0.3% ( -10% - 10%) 0.864
HighPhrase 95.78 (3.2%) 95.60 (2.4%) -0.2% ( -5% - 5%) 0.836
OrNotHighLow 1009.27 (4.0%) 1007.45 (4.3%) -0.2% ( -8% - 8%) 0.891
BrowseDayOfYearSSDVFacets 3.36 (16.2%) 3.35 (8.5%) -0.1% ( -21% - 29%) 0.972
TermDTSort 145.69 (2.9%) 145.68 (3.5%) -0.0% ( -6% - 6%) 0.992
Prefix3 660.06 (4.4%) 661.23 (4.1%) 0.2% ( -8% - 9%) 0.896
Wildcard 112.87 (3.7%) 113.14 (3.1%) 0.2% ( -6% - 7%) 0.828
HighTermTitleBDVSort 16.47 (5.8%) 16.51 (3.8%) 0.3% ( -8% - 10%) 0.860
OrHighLow 472.58 (3.0%) 473.97 (3.1%) 0.3% ( -5% - 6%) 0.760
IntNRQ 471.18 (3.0%) 472.58 (3.1%) 0.3% ( -5% - 6%) 0.760
BrowseRandomLabelTaxoFacets 1.89 (4.4%) 1.90 (4.2%) 0.3% ( -7% - 9%) 0.815
AndHighHighDayTaxoFacets 20.08 (4.3%) 20.16 (3.6%) 0.4% ( -7% - 8%) 0.769
BrowseDateTaxoFacets 2.30 (9.1%) 2.31 (10.7%) 0.4% ( -17% - 22%) 0.891
Respell 45.34 (5.1%) 45.55 (6.9%) 0.5% ( -11% - 13%) 0.805
Fuzzy2 45.72 (6.7%) 45.94 (6.2%) 0.5% ( -11% - 14%) 0.814
HighTermDayOfYearSort 152.97 (3.0%) 153.83 (2.1%) 0.6% ( -4% - 5%) 0.494
AndHighLow 1031.24 (4.7%) 1037.08 (4.0%) 0.6% ( -7% - 9%) 0.683
LowTerm 576.86 (2.6%) 580.90 (3.8%) 0.7% ( -5% - 7%) 0.493
BrowseDayOfYearTaxoFacets 2.31 (9.1%) 2.33 (9.5%) 0.8% ( -16% - 21%) 0.784
OrNotHighHigh 343.60 (4.9%) 346.55 (6.0%) 0.9% ( -9% - 12%) 0.617
MedIntervalsOrdered 6.80 (5.8%) 6.88 (6.9%) 1.2% ( -10% - 14%) 0.555
OrHighHigh 80.13 (7.8%) 81.08 (7.3%) 1.2% ( -12% - 17%) 0.618
LowPhrase 107.91 (14.1%) 109.76 (16.3%) 1.7% ( -25% - 37%) 0.723
OrHighMed 204.76 (3.0%) 208.43 (2.6%) 1.8% ( -3% - 7%) 0.044
HighTermTitleSort 59.78 (3.9%) 61.12 (3.2%) 2.2% ( -4% - 9%) 0.048
MedTerm 564.17 (4.8%) 577.08 (5.2%) 2.3% ( -7% - 12%) 0.150
BrowseRandomLabelSSDVFacets 2.14 (5.2%) 2.19 (5.4%) 2.3% ( -7% - 13%) 0.169
LowSloppyPhrase 9.31 (9.1%) 9.53 (8.4%) 2.4% ( -13% - 21%) 0.392
HighTerm 405.20 (3.1%) 415.04 (5.3%) 2.4% ( -5% - 11%) 0.075
OrHighNotLow 561.56 (4.4%) 577.04 (5.0%) 2.8% ( -6% - 12%) 0.064
MedSpanNear 48.54 (9.5%) 49.97 (8.1%) 3.0% ( -13% - 22%) 0.290
OrHighNotHigh 303.50 (4.7%) 313.29 (4.6%) 3.2% ( -5% - 13%) 0.028
MedSloppyPhrase 22.60 (9.1%) 23.41 (6.9%) 3.6% ( -11% - 21%) 0.164
range 1881.74 (9.2%) 1951.73 (11.7%) 3.7% ( -15% - 27%) 0.264
OrHighNotMed 337.02 (6.3%) 350.53 (7.1%) 4.0% ( -8% - 18%) 0.059
PointRangeQuery bottlenecks of adding doc IDs to a bit set rather than on comparisons, so I wonder if this optimization really helps in practice. On your benchmark run, I see no query with a speedup and a low p-value?
On your benchmark run, I see no query with a speedup and a low p-value?
I primarily wanted to ensure there is no regression. Do we have any benchmark queries with low == high?
PointRangeQuery bottlenecks of adding doc IDs to a bit set rather than on comparisons, so I wonder if this optimization really helps in practice.
I can see that it is hard to measure the impact of this change. Will try adding some low level benchmark and see if this really helps.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!