lucene
lucene copied to clipboard
Try using Murmurhash 3 for bloom filters
Description
We are currently using Murmurhash 2(MurmurHash64.java) in the bloom filters implementation in lucene where we also have Murmurhash 3 (the latest one available in the MurmurHash family of hash functions) and provides better performance, avalanche effect etc. It provides 128-bit variant which Murmurhash 2 doesn't. This PR aims to use Murmurhash 3 with bloom filters and see if that helps. Note : We are already using murmur hash 3 in BytesRefHash implementation (#6666)
Next steps :
- [DONE] Run
luceneutilbenchmarks and share the results
Below are the luceneutil benchmark results for wikimediumall. Looks all flat and good to me.
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
LowIntervalsOrdered 6.08 (5.0%) 6.00 (6.3%) -1.3% ( -12% - 10%) 0.463
TermDTSort 97.64 (4.6%) 96.36 (5.6%) -1.3% ( -11% - 9%) 0.417
OrHighHigh 28.51 (5.7%) 28.28 (5.7%) -0.8% ( -11% - 11%) 0.651
HighTermMonthSort 2287.85 (2.9%) 2270.76 (3.3%) -0.7% ( -6% - 5%) 0.447
MedTermDayTaxoFacets 14.70 (4.6%) 14.60 (4.0%) -0.7% ( -8% - 8%) 0.620
BrowseDayOfYearTaxoFacets 3.94 (28.1%) 3.91 (28.0%) -0.7% ( -44% - 77%) 0.941
Respell 54.80 (1.8%) 54.45 (2.2%) -0.6% ( -4% - 3%) 0.307
BrowseDateTaxoFacets 3.92 (27.5%) 3.89 (27.4%) -0.6% ( -43% - 74%) 0.942
BrowseMonthSSDVFacets 4.45 (10.1%) 4.43 (9.7%) -0.6% ( -18% - 21%) 0.845
MedSpanNear 91.41 (2.6%) 90.86 (3.7%) -0.6% ( -6% - 5%) 0.554
MedIntervalsOrdered 6.61 (3.7%) 6.57 (4.4%) -0.5% ( -8% - 7%) 0.671
HighIntervalsOrdered 6.73 (3.8%) 6.69 (4.2%) -0.5% ( -8% - 7%) 0.691
HighTermTitleSort 121.44 (4.0%) 120.97 (4.6%) -0.4% ( -8% - 8%) 0.778
HighSpanNear 5.31 (2.6%) 5.29 (3.2%) -0.4% ( -6% - 5%) 0.701
LowSpanNear 50.68 (2.6%) 50.51 (3.3%) -0.3% ( -6% - 5%) 0.731
OrHighMed 46.96 (3.3%) 46.83 (3.2%) -0.3% ( -6% - 6%) 0.788
OrHighNotLow 235.93 (5.2%) 235.59 (6.5%) -0.1% ( -11% - 12%) 0.937
HighSloppyPhrase 4.70 (4.7%) 4.70 (5.6%) -0.1% ( -9% - 10%) 0.931
BrowseMonthTaxoFacets 4.15 (35.3%) 4.15 (35.3%) -0.1% ( -52% - 108%) 0.991
Fuzzy2 54.22 (1.1%) 54.15 (1.3%) -0.1% ( -2% - 2%) 0.745
OrNotHighLow 517.03 (1.6%) 516.41 (1.6%) -0.1% ( -3% - 3%) 0.816
Fuzzy1 38.59 (1.2%) 38.56 (1.3%) -0.1% ( -2% - 2%) 0.808
BrowseRandomLabelSSDVFacets 2.77 (7.5%) 2.77 (7.5%) -0.0% ( -14% - 16%) 0.986
AndHighHigh 29.06 (3.2%) 29.06 (3.1%) -0.0% ( -6% - 6%) 0.991
AndHighMed 55.89 (2.9%) 55.91 (2.7%) 0.0% ( -5% - 5%) 0.970
OrHighLow 319.86 (2.2%) 320.00 (2.0%) 0.0% ( -4% - 4%) 0.945
IntNRQ 21.94 (2.5%) 21.96 (3.4%) 0.1% ( -5% - 6%) 0.905
LowPhrase 136.01 (4.1%) 136.20 (3.8%) 0.1% ( -7% - 8%) 0.908
OrNotHighMed 208.82 (3.6%) 209.20 (3.9%) 0.2% ( -7% - 8%) 0.879
OrNotHighHigh 241.45 (3.8%) 241.96 (4.9%) 0.2% ( -8% - 9%) 0.880
OrHighNotHigh 191.17 (4.6%) 191.65 (5.7%) 0.3% ( -9% - 11%) 0.878
AndHighLow 300.40 (3.0%) 301.26 (2.7%) 0.3% ( -5% - 6%) 0.754
BrowseDateSSDVFacets 0.90 (12.4%) 0.91 (10.9%) 0.3% ( -20% - 26%) 0.933
BrowseRandomLabelTaxoFacets 3.34 (24.3%) 3.36 (24.6%) 0.4% ( -39% - 65%) 0.963
OrHighNotMed 215.89 (5.5%) 216.69 (6.3%) 0.4% ( -10% - 12%) 0.843
Wildcard 61.64 (2.0%) 61.91 (2.2%) 0.4% ( -3% - 4%) 0.501
HighTermTitleBDVSort 5.43 (4.3%) 5.46 (4.5%) 0.5% ( -8% - 9%) 0.729
PKLookup 136.14 (1.9%) 136.97 (1.9%) 0.6% ( -3% - 4%) 0.310
HighTermDayOfYearSort 195.65 (4.1%) 196.91 (4.6%) 0.6% ( -7% - 9%) 0.641
MedTerm 455.25 (5.3%) 458.30 (5.8%) 0.7% ( -9% - 12%) 0.702
LowSloppyPhrase 30.11 (2.0%) 30.36 (1.8%) 0.8% ( -2% - 4%) 0.171
HighTerm 367.19 (6.1%) 370.21 (6.3%) 0.8% ( -10% - 14%) 0.675
LowTerm 449.59 (4.1%) 453.37 (4.8%) 0.8% ( -7% - 10%) 0.549
HighPhrase 20.31 (7.3%) 20.50 (6.4%) 0.9% ( -11% - 15%) 0.675
MedPhrase 22.76 (6.9%) 22.98 (6.1%) 1.0% ( -11% - 14%) 0.643
AndHighMedDayTaxoFacets 47.42 (1.8%) 47.88 (2.0%) 1.0% ( -2% - 4%) 0.109
OrHighMedDayTaxoFacets 1.83 (3.6%) 1.85 (3.9%) 1.0% ( -6% - 8%) 0.390
AndHighHighDayTaxoFacets 14.08 (2.4%) 14.27 (2.6%) 1.3% ( -3% - 6%) 0.084
MedSloppyPhrase 2.77 (4.0%) 2.81 (4.1%) 1.4% ( -6% - 9%) 0.282
BrowseDayOfYearSSDVFacets 3.72 (8.6%) 3.78 (11.9%) 1.8% ( -17% - 24%) 0.576
Prefix3 70.45 (8.3%) 72.84 (4.0%) 3.4% ( -8% - 17%) 0.099
@mikemccand rightly pointed out that luceneutil doesn't use bloom filter postings format by default and we should enable it for id field and rerun the benchmarks to see the impact. Will rerun and share the results here.
Maybe we should take advantage of this change to simplify this postings format, by no longer making the hash function configurable, removing the abstraction for hash functions, and cutting over to StringHelper to compute hashes instead of introducing a new implementation?
Maybe we should take advantage of this change to simplify this postings format, by no longer making the hash function configurable, removing the abstraction for hash functions, and cutting over to StringHelper to compute hashes instead of introducing a new implementation?
@jpountz Makes sense, +1 .to remove the abstraction Though I have a couple of points/questions here :
- Do you mean to change
StringHelperclass to add support for 128 bit hash because currently it creates 32-bit hash with Murmur 3? or maybe movingBytesRefHashto also use 128-bit hash? StringHelperdoesn't seem like the most intuitive place for a hash function implementation like this. Do you think we should instead have or copy to something likeHashHelperorHashUtil?
So I ran the luceneutil benchmarks with -idFieldPostingsFormat BloomFilter but it was failing as there was no delegate posting format and it wasn't able to find the right postings format class using SPI. I tweaked this code (pasted below) to use the BloomFilteringPostingsFormat for the id field and also use the codecs jar (similar to how its done for core) and then all worked.
public PostingsFormat getPostingsFormatForField(String field) {
PostingsFormat pf = PostingsFormat.forName(defaultPostingsFormat);
if (field.equals("id")) {
return new BloomFilteringPostingsFormat(pf);
}
return pf;
}
Below are the wikimediumall benchmark results(ran twice to get more confidence) which shows ~7-9% performance improvement for PKLookup with p-value of 0.000
Run # 1
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
BrowseRandomLabelSSDVFacets 2.77 (7.0%) 2.72 (5.1%) -1.7% ( -12% - 11%) 0.374
BrowseRandomLabelTaxoFacets 3.35 (21.8%) 3.30 (18.1%) -1.6% ( -34% - 48%) 0.803
LowSloppyPhrase 6.11 (2.9%) 6.05 (2.5%) -1.0% ( -6% - 4%) 0.248
HighTermTitleSort 128.04 (3.7%) 126.78 (3.1%) -1.0% ( -7% - 6%) 0.365
HighSloppyPhrase 13.43 (2.7%) 13.31 (2.6%) -0.9% ( -6% - 4%) 0.254
MedSloppyPhrase 4.72 (3.4%) 4.68 (2.6%) -0.8% ( -6% - 5%) 0.393
OrHighMedDayTaxoFacets 2.99 (5.2%) 2.97 (3.5%) -0.8% ( -9% - 8%) 0.582
BrowseDateTaxoFacets 3.85 (19.0%) 3.82 (17.7%) -0.8% ( -31% - 44%) 0.896
BrowseDayOfYearTaxoFacets 3.86 (19.0%) 3.83 (17.9%) -0.7% ( -31% - 44%) 0.900
OrHighMed 72.86 (2.0%) 72.58 (2.4%) -0.4% ( -4% - 4%) 0.586
OrHighHigh 20.04 (3.1%) 19.98 (4.3%) -0.3% ( -7% - 7%) 0.801
HighPhrase 24.66 (6.2%) 24.59 (6.6%) -0.3% ( -12% - 13%) 0.882
Prefix3 72.76 (5.0%) 72.59 (4.3%) -0.2% ( -9% - 9%) 0.873
MedTerm 379.26 (3.1%) 378.51 (5.1%) -0.2% ( -8% - 8%) 0.882
MedPhrase 18.99 (5.5%) 18.95 (5.9%) -0.2% ( -10% - 11%) 0.915
BrowseDateSSDVFacets 0.90 (7.7%) 0.89 (7.4%) -0.2% ( -14% - 16%) 0.944
LowPhrase 63.34 (2.4%) 63.27 (2.8%) -0.1% ( -5% - 5%) 0.887
HighSpanNear 6.18 (3.1%) 6.18 (3.2%) 0.0% ( -6% - 6%) 0.993
Fuzzy1 64.49 (1.1%) 64.52 (1.4%) 0.0% ( -2% - 2%) 0.919
LowSpanNear 15.60 (2.3%) 15.61 (2.5%) 0.1% ( -4% - 5%) 0.916
HighTermTitleBDVSort 4.90 (4.2%) 4.90 (3.8%) 0.1% ( -7% - 8%) 0.938
MedTermDayTaxoFacets 9.26 (5.6%) 9.27 (4.1%) 0.1% ( -9% - 10%) 0.948
AndHighMedDayTaxoFacets 13.99 (1.7%) 14.02 (1.5%) 0.2% ( -3% - 3%) 0.764
AndHighHighDayTaxoFacets 5.26 (2.7%) 5.27 (2.4%) 0.2% ( -4% - 5%) 0.819
Respell 29.96 (1.1%) 30.03 (1.8%) 0.2% ( -2% - 3%) 0.656
LowTerm 398.65 (2.5%) 399.67 (2.9%) 0.3% ( -5% - 5%) 0.765
MedSpanNear 38.84 (2.6%) 38.96 (3.0%) 0.3% ( -5% - 6%) 0.729
Wildcard 60.70 (1.6%) 60.89 (1.1%) 0.3% ( -2% - 3%) 0.463
HighTermDayOfYearSort 200.04 (2.5%) 200.70 (3.7%) 0.3% ( -5% - 6%) 0.740
OrHighLow 252.74 (2.0%) 253.62 (2.4%) 0.3% ( -3% - 4%) 0.613
OrNotHighHigh 157.04 (4.6%) 157.63 (4.2%) 0.4% ( -8% - 9%) 0.789
AndHighHigh 29.31 (2.4%) 29.42 (3.5%) 0.4% ( -5% - 6%) 0.678
OrNotHighLow 290.56 (2.1%) 291.81 (1.7%) 0.4% ( -3% - 4%) 0.475
OrNotHighMed 221.77 (3.5%) 222.84 (2.9%) 0.5% ( -5% - 7%) 0.633
OrHighNotHigh 167.01 (4.8%) 167.85 (4.5%) 0.5% ( -8% - 10%) 0.731
HighTerm 279.21 (4.3%) 280.66 (6.5%) 0.5% ( -9% - 11%) 0.767
AndHighLow 374.84 (1.9%) 377.10 (1.8%) 0.6% ( -3% - 4%) 0.308
HighTermMonthSort 2378.06 (3.6%) 2392.49 (4.1%) 0.6% ( -6% - 8%) 0.618
LowIntervalsOrdered 12.78 (2.4%) 12.86 (2.8%) 0.6% ( -4% - 6%) 0.443
OrHighNotLow 269.59 (4.8%) 271.34 (4.9%) 0.6% ( -8% - 10%) 0.672
IntNRQ 18.20 (5.9%) 18.32 (5.4%) 0.7% ( -10% - 12%) 0.709
AndHighMed 37.67 (2.5%) 37.93 (3.5%) 0.7% ( -5% - 6%) 0.459
MedIntervalsOrdered 1.80 (3.5%) 1.82 (3.7%) 0.8% ( -6% - 8%) 0.456
OrHighNotMed 248.23 (4.7%) 250.34 (4.5%) 0.9% ( -7% - 10%) 0.556
Fuzzy2 35.10 (1.2%) 35.42 (1.2%) 0.9% ( -1% - 3%) 0.016
BrowseMonthTaxoFacets 4.13 (30.7%) 4.17 (34.8%) 1.0% ( -49% - 96%) 0.924
BrowseMonthSSDVFacets 4.37 (9.6%) 4.45 (8.9%) 1.9% ( -15% - 22%) 0.506
HighIntervalsOrdered 1.58 (4.6%) 1.61 (5.6%) 2.0% ( -7% - 12%) 0.228
TermDTSort 96.97 (3.2%) 98.92 (5.0%) 2.0% ( -6% - 10%) 0.132
BrowseDayOfYearSSDVFacets 3.76 (9.6%) 3.85 (6.5%) 2.3% ( -12% - 20%) 0.366
PKLookup 106.69 (1.5%) 114.71 (1.5%) 7.5% ( 4% - 10%) 0.000
Run # 2
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
BrowseMonthTaxoFacets 4.08 (31.7%) 3.80 (1.8%) -6.7% ( -30% - 39%) 0.342
BrowseRandomLabelTaxoFacets 3.30 (21.9%) 3.16 (2.3%) -4.2% ( -23% - 25%) 0.393
BrowseDateTaxoFacets 3.85 (20.0%) 3.72 (6.1%) -3.5% ( -24% - 28%) 0.456
BrowseDayOfYearTaxoFacets 3.86 (20.0%) 3.73 (6.2%) -3.3% ( -24% - 28%) 0.476
OrHighHigh 26.35 (6.9%) 25.88 (6.8%) -1.8% ( -14% - 12%) 0.404
HighTermDayOfYearSort 209.99 (3.7%) 206.48 (4.1%) -1.7% ( -9% - 6%) 0.177
OrHighLow 273.41 (2.3%) 270.80 (3.0%) -1.0% ( -6% - 4%) 0.252
HighTermMonthSort 2346.36 (3.2%) 2326.36 (3.6%) -0.9% ( -7% - 6%) 0.427
HighTermTitleBDVSort 4.83 (3.7%) 4.80 (3.4%) -0.7% ( -7% - 6%) 0.522
Prefix3 584.55 (3.3%) 580.95 (3.1%) -0.6% ( -6% - 5%) 0.541
OrHighMed 78.07 (2.7%) 77.60 (2.9%) -0.6% ( -6% - 5%) 0.489
TermDTSort 92.32 (4.1%) 91.89 (4.4%) -0.5% ( -8% - 8%) 0.732
HighTermTitleSort 138.64 (2.4%) 138.05 (2.5%) -0.4% ( -5% - 4%) 0.580
AndHighHigh 20.00 (5.1%) 19.93 (4.3%) -0.4% ( -9% - 9%) 0.791
BrowseDateSSDVFacets 0.90 (9.5%) 0.90 (8.0%) -0.4% ( -16% - 18%) 0.891
Fuzzy2 39.56 (1.2%) 39.41 (1.4%) -0.4% ( -2% - 2%) 0.362
AndHighHighDayTaxoFacets 2.08 (4.1%) 2.07 (4.0%) -0.3% ( -8% - 8%) 0.798
Fuzzy1 66.33 (1.0%) 66.17 (1.1%) -0.2% ( -2% - 1%) 0.471
Wildcard 40.31 (3.9%) 40.22 (3.7%) -0.2% ( -7% - 7%) 0.856
HighSloppyPhrase 11.01 (1.8%) 11.00 (2.2%) -0.1% ( -4% - 3%) 0.852
OrHighNotLow 219.55 (7.5%) 219.38 (7.3%) -0.1% ( -13% - 15%) 0.974
Respell 50.51 (1.6%) 50.48 (1.5%) -0.1% ( -3% - 3%) 0.915
IntNRQ 18.46 (8.9%) 18.46 (9.1%) -0.0% ( -16% - 19%) 0.988
AndHighMed 82.85 (3.1%) 82.84 (2.5%) -0.0% ( -5% - 5%) 0.982
OrNotHighLow 512.93 (2.0%) 512.86 (1.9%) -0.0% ( -3% - 3%) 0.982
LowSpanNear 64.37 (2.4%) 64.44 (2.7%) 0.1% ( -4% - 5%) 0.886
OrNotHighHigh 278.80 (6.4%) 279.28 (6.0%) 0.2% ( -11% - 13%) 0.931
LowTerm 351.93 (4.1%) 352.53 (4.4%) 0.2% ( -7% - 9%) 0.898
OrNotHighMed 201.78 (5.3%) 202.14 (5.1%) 0.2% ( -9% - 11%) 0.913
OrHighNotHigh 196.39 (6.5%) 196.74 (6.5%) 0.2% ( -11% - 14%) 0.930
LowSloppyPhrase 4.06 (4.1%) 4.07 (4.6%) 0.2% ( -8% - 9%) 0.865
AndHighMedDayTaxoFacets 29.95 (1.5%) 30.04 (1.7%) 0.3% ( -2% - 3%) 0.577
OrHighMedDayTaxoFacets 3.47 (5.7%) 3.48 (4.3%) 0.3% ( -9% - 10%) 0.857
MedIntervalsOrdered 7.68 (6.1%) 7.71 (6.6%) 0.4% ( -11% - 13%) 0.858
MedTerm 462.78 (5.3%) 464.47 (6.6%) 0.4% ( -10% - 12%) 0.847
AndHighLow 274.17 (2.2%) 275.22 (2.5%) 0.4% ( -4% - 5%) 0.606
HighSpanNear 3.88 (3.9%) 3.90 (4.8%) 0.5% ( -7% - 9%) 0.738
BrowseDayOfYearSSDVFacets 3.60 (8.3%) 3.62 (9.9%) 0.5% ( -16% - 20%) 0.863
OrHighNotMed 286.50 (6.4%) 287.93 (6.3%) 0.5% ( -11% - 14%) 0.803
LowIntervalsOrdered 4.82 (3.9%) 4.85 (3.8%) 0.5% ( -6% - 8%) 0.678
MedSpanNear 4.81 (3.1%) 4.84 (4.0%) 0.6% ( -6% - 7%) 0.616
HighIntervalsOrdered 4.37 (4.8%) 4.40 (5.0%) 0.6% ( -8% - 10%) 0.700
MedTermDayTaxoFacets 11.79 (2.7%) 11.86 (3.1%) 0.6% ( -5% - 6%) 0.507
MedSloppyPhrase 36.65 (5.2%) 36.95 (4.6%) 0.8% ( -8% - 11%) 0.592
HighTerm 318.60 (6.5%) 322.10 (7.5%) 1.1% ( -12% - 16%) 0.621
LowPhrase 31.99 (4.0%) 32.35 (2.0%) 1.1% ( -4% - 7%) 0.255
BrowseMonthSSDVFacets 4.35 (13.5%) 4.41 (14.8%) 1.3% ( -23% - 34%) 0.771
MedPhrase 54.81 (3.5%) 55.56 (2.4%) 1.4% ( -4% - 7%) 0.147
BrowseRandomLabelSSDVFacets 2.68 (8.5%) 2.73 (8.2%) 1.6% ( -13% - 19%) 0.556
HighPhrase 3.07 (6.6%) 3.15 (4.3%) 2.5% ( -7% - 14%) 0.150
PKLookup 105.96 (1.1%) 115.03 (1.4%) 8.6% ( 5% - 11%) 0.000
I also ran the benchmarks for wikibigall. Here also we see similarly consistent 7% improvement for PKLookup with 0.000 p-value. Below are the benchmark results :
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
BrowseMonthSSDVFacets 23.50 (9.9%) 22.82 (10.6%) -2.9% ( -21% - 19%) 0.374
LowSloppyPhrase 7.89 (6.1%) 7.84 (7.2%) -0.7% ( -13% - 13%) 0.737
TermDTSort 169.47 (5.8%) 168.46 (5.6%) -0.6% ( -11% - 11%) 0.740
HighTermTitleBDVSort 10.66 (6.2%) 10.60 (6.2%) -0.6% ( -12% - 12%) 0.770
BrowseDayOfYearSSDVFacets 20.09 (14.5%) 19.98 (17.1%) -0.6% ( -28% - 36%) 0.910
OrHighMed 184.16 (3.2%) 183.28 (3.7%) -0.5% ( -7% - 6%) 0.660
OrHighHigh 51.72 (5.6%) 51.48 (5.8%) -0.5% ( -11% - 11%) 0.796
BrowseRandomLabelTaxoFacets 11.19 (5.1%) 11.14 (4.5%) -0.4% ( -9% - 9%) 0.772
MedSloppyPhrase 17.91 (3.5%) 17.84 (4.2%) -0.4% ( -7% - 7%) 0.752
LowPhrase 13.42 (5.3%) 13.37 (6.3%) -0.4% ( -11% - 11%) 0.848
AndHighLow 529.38 (2.2%) 527.55 (2.9%) -0.3% ( -5% - 4%) 0.674
BrowseDateSSDVFacets 4.47 (11.1%) 4.46 (10.4%) -0.3% ( -19% - 23%) 0.924
BrowseDateTaxoFacets 12.59 (4.0%) 12.55 (3.5%) -0.3% ( -7% - 7%) 0.810
OrHighMedDayTaxoFacets 10.06 (4.9%) 10.04 (4.7%) -0.2% ( -9% - 9%) 0.914
HighTermDayOfYearSort 314.20 (1.9%) 313.90 (1.7%) -0.1% ( -3% - 3%) 0.867
BrowseMonthTaxoFacets 12.34 (3.4%) 12.35 (2.8%) 0.1% ( -5% - 6%) 0.903
AndHighMed 164.82 (4.1%) 165.05 (4.3%) 0.1% ( -7% - 8%) 0.916
HighPhrase 40.51 (4.2%) 40.57 (5.6%) 0.1% ( -9% - 10%) 0.926
Prefix3 793.33 (2.5%) 794.74 (2.8%) 0.2% ( -5% - 5%) 0.833
OrNotHighLow 731.50 (2.6%) 733.25 (2.4%) 0.2% ( -4% - 5%) 0.760
OrHighLow 394.15 (4.5%) 395.24 (4.5%) 0.3% ( -8% - 9%) 0.845
MedTermDayTaxoFacets 27.09 (5.4%) 27.17 (4.6%) 0.3% ( -9% - 10%) 0.857
Respell 37.26 (1.9%) 37.40 (1.9%) 0.4% ( -3% - 4%) 0.540
MedPhrase 68.95 (3.2%) 69.21 (4.5%) 0.4% ( -7% - 8%) 0.764
HighTermTitleSort 103.69 (2.8%) 104.13 (3.2%) 0.4% ( -5% - 6%) 0.659
HighSpanNear 2.72 (5.7%) 2.74 (3.9%) 0.5% ( -8% - 10%) 0.755
HighSloppyPhrase 7.67 (5.8%) 7.71 (6.5%) 0.5% ( -11% - 13%) 0.801
Wildcard 18.57 (2.9%) 18.66 (2.8%) 0.5% ( -5% - 6%) 0.581
Fuzzy1 73.74 (1.2%) 74.11 (1.4%) 0.5% ( -2% - 3%) 0.231
AndHighHighDayTaxoFacets 15.25 (2.1%) 15.32 (2.8%) 0.5% ( -4% - 5%) 0.522
AndHighMedDayTaxoFacets 20.82 (2.1%) 20.93 (2.8%) 0.5% ( -4% - 5%) 0.524
AndHighHigh 22.11 (5.7%) 22.23 (5.4%) 0.5% ( -10% - 12%) 0.759
HighTerm 255.66 (5.9%) 257.09 (4.9%) 0.6% ( -9% - 12%) 0.744
Fuzzy2 62.45 (1.1%) 62.81 (1.4%) 0.6% ( -1% - 3%) 0.149
BrowseRandomLabelSSDVFacets 15.32 (6.9%) 15.42 (10.6%) 0.7% ( -15% - 19%) 0.816
MedTerm 286.47 (6.0%) 288.39 (4.6%) 0.7% ( -9% - 11%) 0.689
IntNRQ 119.53 (3.3%) 120.36 (5.0%) 0.7% ( -7% - 9%) 0.600
LowTerm 461.65 (5.4%) 464.97 (3.9%) 0.7% ( -8% - 10%) 0.631
HighTermMonthSort 2546.87 (3.3%) 2566.25 (2.4%) 0.8% ( -4% - 6%) 0.403
LowIntervalsOrdered 18.69 (3.6%) 18.84 (3.4%) 0.8% ( -5% - 8%) 0.468
BrowseDayOfYearTaxoFacets 13.53 (6.0%) 13.64 (6.3%) 0.8% ( -10% - 13%) 0.669
OrNotHighMed 114.22 (3.0%) 115.51 (2.5%) 1.1% ( -4% - 6%) 0.195
OrHighNotMed 278.93 (4.7%) 282.15 (4.8%) 1.2% ( -7% - 11%) 0.443
OrNotHighHigh 81.05 (5.1%) 82.09 (4.5%) 1.3% ( -7% - 11%) 0.402
OrHighNotHigh 151.36 (5.3%) 153.39 (4.7%) 1.3% ( -8% - 12%) 0.398
MedIntervalsOrdered 17.52 (5.0%) 17.78 (4.4%) 1.4% ( -7% - 11%) 0.333
HighIntervalsOrdered 3.42 (4.7%) 3.48 (4.4%) 1.8% ( -7% - 11%) 0.223
OrHighNotLow 231.81 (5.6%) 236.03 (5.4%) 1.8% ( -8% - 13%) 0.297
MedSpanNear 11.01 (8.1%) 11.23 (5.5%) 2.0% ( -10% - 16%) 0.367
LowSpanNear 4.54 (11.5%) 4.67 (7.3%) 3.0% ( -14% - 24%) 0.326
PKLookup 122.20 (1.3%) 130.59 (2.0%) 6.9% ( 3% - 10%) 0.000
Do you mean to change StringHelper class to add support for 128 bit hash because currently it creates 32-bit hash with Murmur 3?
Sorry, I had overlooked that StringHelper only used 32 bits for its hash. If it's not a good fit, I'm good with hardcoding murmur3 in this postings format.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@shubhamvishu -- I think the PKLookup gains here are compelling, and there is consensus to make this improvement only to the BloomFilterPostingsFormat? But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...
I think the PKLookup gains here are compelling, and there is consensus to make this improvement only to the BloomFilterPostingsFormat?
Sure @mikemccand , I'll update this PR with the comments Adrien had earlier. I'll also post the new results shortly(this week) as its been sometime we last ran the benchmarks.
But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...
For this I think we can open another issue to understand why we see different results.
But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...
The expression ((int) A) >>> 1 + ((int) B) >>> 1 is equivalent to (((int) A) >>> (1 + ((int) B))) >>> 1 when we see the operator precedence and this seems to be causing the hash values to be bigger and sometimes smaller(I manually tried a couple of them) as opposed to using simple ((A/2) + (B/2)) which does not achieving that and produces large hashes everytime. I think this small tweak is generating hashes spread more uniformly in the range and leading to less collisions probably?.Another thing that helps is that the ideally we should have seen a regression if this was generating some garbage or similar hashes but we consistently seen improvement in both wikimediumall and wikibigall earlier. I'll share the latest benchmarks as well when they are completed. Any thoughts? @mikemccand @jpountz
But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...
The expression
((int) A) >>> 1 + ((int) B) >>> 1is equivalent to(((int) A) >>> (1 + ((int) B))) >>> 1when we see the operator precedence
OK I see. Yes indeed the expression is buggy (does not match murmur3 hash function) without the added ( ... ) thanks to java's operator precedence.
It would indeed be surprising if you accidentally discovered a better hashing function than murmur3 with this bug. It is curious that you see consistently better PKLookup performance ... do you see any gains vs main with the bug fixed (added parens)?
Could we maybe add a random test to assert that your impl in BloomPostingsFormat matches StringUtil's murmur3?
do you see any gains vs main with the bug fixed (added parens)?
I ran the luceneutil benchmarks with this PR again and also when fixing the expression using "()". It required few changes in luceneutil to enable BloomFilteringPostingsFormat for id field (which are in the below commit).
- Required
luceneutilchanges - https://github.com/shubhamvishu/luceneutil/commit/f675c6611a678f932360a89883b22f9ba9b75a8b
Here are the the latest results that I'm seeing with this PR :
1. Run # 1 : Result when using new expression i.e. this PR
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
OrHighNotMed 151.52 (3.2%) 105.32 (2.6%) -30.5% ( -35% - -25%) 0.000
MedIntervalsOrdered 35.15 (8.4%) 30.00 (7.7%) -14.7% ( -28% - 1%) 0.000
HighPhrase 43.82 (4.9%) 37.83 (3.4%) -13.7% ( -20% - -5%) 0.000
MedPhrase 36.18 (11.7%) 31.64 (4.6%) -12.5% ( -25% - 4%) 0.000
HighIntervalsOrdered 12.19 (8.6%) 10.79 (6.4%) -11.5% ( -24% - 3%) 0.000
BrowseDayOfYearSSDVFacets 3.49 (13.7%) 3.11 (12.2%) -10.8% ( -32% - 17%) 0.008
HighTermMonthSort 1160.29 (4.2%) 1064.20 (2.8%) -8.3% ( -14% - -1%) 0.000
BrowseMonthSSDVFacets 3.49 (15.0%) 3.27 (18.0%) -6.5% ( -34% - 31%) 0.216
AndHighHigh 90.31 (5.8%) 85.38 (5.4%) -5.5% ( -15% - 6%) 0.002
BrowseRandomLabelSSDVFacets 2.17 (5.1%) 2.06 (7.9%) -5.4% ( -17% - 7%) 0.010
MedTermDayTaxoFacets 8.82 (6.5%) 8.35 (5.6%) -5.4% ( -16% - 7%) 0.005
IntNRQ 60.02 (9.2%) 57.07 (13.7%) -4.9% ( -25% - 19%) 0.184
HighTermDayOfYearSort 75.06 (2.8%) 71.92 (4.4%) -4.2% ( -11% - 3%) 0.000
MedSpanNear 22.04 (5.1%) 21.25 (4.2%) -3.6% ( -12% - 6%) 0.015
Prefix3 417.51 (3.3%) 402.97 (3.5%) -3.5% ( -9% - 3%) 0.001
Fuzzy1 66.49 (3.1%) 64.58 (2.8%) -2.9% ( -8% - 3%) 0.002
Fuzzy2 60.77 (3.9%) 59.11 (5.2%) -2.7% ( -11% - 6%) 0.061
AndHighLow 750.92 (3.5%) 738.33 (3.8%) -1.7% ( -8% - 5%) 0.150
OrHighMedDayTaxoFacets 3.19 (6.3%) 3.14 (6.8%) -1.5% ( -13% - 12%) 0.481
AndHighMedDayTaxoFacets 7.77 (3.7%) 7.68 (4.6%) -1.2% ( -9% - 7%) 0.368
Respell 42.71 (4.1%) 42.45 (4.5%) -0.6% ( -8% - 8%) 0.659
AndHighHighDayTaxoFacets 2.68 (4.3%) 2.66 (4.1%) -0.6% ( -8% - 8%) 0.673
HighSloppyPhrase 18.32 (4.5%) 18.26 (6.0%) -0.3% ( -10% - 10%) 0.851
OrNotHighLow 947.96 (4.4%) 946.42 (2.7%) -0.2% ( -7% - 7%) 0.889
LowSloppyPhrase 24.72 (4.3%) 24.80 (5.8%) 0.3% ( -9% - 10%) 0.834
OrHighHigh 63.44 (8.1%) 64.04 (6.8%) 0.9% ( -12% - 17%) 0.694
LowIntervalsOrdered 47.35 (6.2%) 47.85 (7.0%) 1.1% ( -11% - 15%) 0.608
AndHighMed 175.58 (3.1%) 177.59 (3.6%) 1.1% ( -5% - 8%) 0.286
OrHighNotHigh 57.34 (2.5%) 58.02 (2.3%) 1.2% ( -3% - 6%) 0.125
LowSpanNear 54.84 (3.3%) 55.69 (4.3%) 1.6% ( -5% - 9%) 0.201
OrHighLow 366.06 (3.8%) 372.02 (2.7%) 1.6% ( -4% - 8%) 0.119
MedSloppyPhrase 24.57 (4.0%) 24.99 (5.0%) 1.7% ( -6% - 11%) 0.224
Wildcard 161.09 (2.6%) 165.71 (3.1%) 2.9% ( -2% - 8%) 0.002
OrHighMed 116.69 (6.4%) 120.28 (4.6%) 3.1% ( -7% - 15%) 0.082
HighTermTitleBDVSort 7.68 (2.7%) 7.92 (4.0%) 3.1% ( -3% - 10%) 0.004
LowPhrase 21.75 (6.3%) 22.62 (7.7%) 4.0% ( -9% - 19%) 0.072
LowTerm 197.26 (2.9%) 210.16 (2.8%) 6.5% ( 0% - 12%) 0.000
HighTermTitleSort 14.21 (4.2%) 15.16 (3.7%) 6.7% ( -1% - 15%) 0.000
HighSpanNear 4.45 (3.5%) 4.80 (2.6%) 8.0% ( 1% - 14%) 0.000
HighTerm 111.71 (3.9%) 121.04 (2.8%) 8.3% ( 1% - 15%) 0.000
OrNotHighMed 134.75 (3.2%) 146.33 (2.7%) 8.6% ( 2% - 14%) 0.000
PKLookup 110.99 (4.4%) 124.24 (6.0%) 11.9% ( 1% - 23%) 0.000
OrNotHighHigh 76.76 (3.0%) 91.73 (4.0%) 19.5% ( 12% - 27%) 0.000
BrowseDateSSDVFacets 0.57 (13.9%) 0.72 (19.1%) 25.6% ( -6% - 68%) 0.000
BrowseRandomLabelTaxoFacets 1.98 (6.4%) 2.77 (36.1%) 39.9% ( -2% - 88%) 0.000
MedTerm 101.03 (4.3%) 145.90 (5.6%) 44.4% ( 33% - 56%) 0.000
BrowseDateTaxoFacets 2.50 (9.8%) 3.63 (41.4%) 45.2% ( -5% - 106%) 0.000
BrowseDayOfYearTaxoFacets 2.51 (9.3%) 3.70 (42.2%) 47.8% ( -3% - 109%) 0.000
OrHighNotLow 100.87 (3.5%) 189.07 (7.6%) 87.4% ( 73% - 102%) 0.000
TermDTSort 22.89 (2.9%) 54.10 (15.4%) 136.4% ( 114% - 159%) 0.000
BrowseMonthTaxoFacets 2.41 (10.9%) 6.66 (93.8%) 176.2% ( 64% - 315%) 0.000
2. Run # 2 : Result when using new expression i.e. this PR
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
LowSpanNear 76.20 (2.5%) 67.29 (5.3%) -11.7% ( -19% - -3%) 0.000
LowIntervalsOrdered 51.58 (5.5%) 47.22 (6.0%) -8.5% ( -18% - 3%) 0.000
BrowseMonthSSDVFacets 3.49 (20.6%) 3.20 (22.0%) -8.1% ( -42% - 43%) 0.232
BrowseDayOfYearSSDVFacets 3.17 (8.9%) 2.94 (13.4%) -7.2% ( -27% - 16%) 0.045
BrowseRandomLabelSSDVFacets 2.18 (12.8%) 2.03 (11.1%) -7.0% ( -27% - 19%) 0.065
AndHighLow 738.26 (4.8%) 687.21 (5.3%) -6.9% ( -16% - 3%) 0.000
LowSloppyPhrase 91.80 (4.0%) 85.81 (4.9%) -6.5% ( -14% - 2%) 0.000
OrHighHigh 72.55 (9.6%) 68.80 (8.5%) -5.2% ( -21% - 14%) 0.073
MedTermDayTaxoFacets 6.88 (6.5%) 6.57 (8.1%) -4.6% ( -17% - 10%) 0.050
MedPhrase 73.44 (2.7%) 70.35 (5.9%) -4.2% ( -12% - 4%) 0.004
Respell 26.61 (4.0%) 25.53 (5.6%) -4.1% ( -13% - 5%) 0.008
LowTerm 202.01 (3.5%) 193.99 (2.6%) -4.0% ( -9% - 2%) 0.000
MedSloppyPhrase 35.32 (5.4%) 33.94 (7.1%) -3.9% ( -15% - 9%) 0.050
Prefix3 406.66 (6.5%) 392.60 (5.2%) -3.5% ( -14% - 8%) 0.063
AndHighMedDayTaxoFacets 19.45 (4.9%) 18.92 (5.3%) -2.8% ( -12% - 7%) 0.087
AndHighHighDayTaxoFacets 2.77 (5.5%) 2.69 (6.7%) -2.8% ( -14% - 10%) 0.156
OrHighMedDayTaxoFacets 3.34 (5.0%) 3.25 (7.6%) -2.7% ( -14% - 10%) 0.184
Wildcard 768.29 (3.6%) 751.35 (3.9%) -2.2% ( -9% - 5%) 0.060
Fuzzy2 44.04 (5.4%) 43.09 (5.2%) -2.1% ( -12% - 8%) 0.201
OrNotHighLow 853.76 (3.0%) 835.98 (2.6%) -2.1% ( -7% - 3%) 0.019
AndHighHigh 85.25 (7.5%) 83.76 (6.2%) -1.7% ( -14% - 12%) 0.422
HighTermMonthSort 1068.18 (3.6%) 1050.07 (3.9%) -1.7% ( -8% - 6%) 0.153
HighSloppyPhrase 5.96 (3.8%) 5.94 (3.6%) -0.4% ( -7% - 7%) 0.737
HighTermTitleBDVSort 7.39 (3.9%) 7.36 (2.9%) -0.3% ( -6% - 6%) 0.785
HighTerm 83.27 (3.3%) 83.02 (3.5%) -0.3% ( -6% - 6%) 0.781
LowPhrase 92.82 (3.7%) 92.58 (2.1%) -0.3% ( -5% - 5%) 0.791
OrHighLow 266.30 (2.0%) 268.48 (2.3%) 0.8% ( -3% - 5%) 0.229
HighIntervalsOrdered 5.48 (5.7%) 5.53 (5.5%) 0.9% ( -9% - 12%) 0.621
MedIntervalsOrdered 26.98 (5.8%) 27.26 (6.6%) 1.0% ( -10% - 14%) 0.597
MedSpanNear 4.02 (2.6%) 4.09 (3.5%) 1.6% ( -4% - 7%) 0.104
OrHighMed 65.75 (7.5%) 66.80 (6.5%) 1.6% ( -11% - 16%) 0.472
HighPhrase 120.64 (4.8%) 122.75 (4.2%) 1.8% ( -6% - 11%) 0.222
HighSpanNear 11.14 (3.6%) 11.34 (2.7%) 1.8% ( -4% - 8%) 0.073
OrHighNotHigh 83.12 (2.8%) 84.65 (2.3%) 1.8% ( -3% - 7%) 0.022
OrNotHighMed 147.67 (2.2%) 150.48 (2.6%) 1.9% ( -2% - 6%) 0.011
AndHighMed 168.61 (3.1%) 173.38 (3.4%) 2.8% ( -3% - 9%) 0.006
Fuzzy1 75.52 (7.4%) 77.77 (6.1%) 3.0% ( -9% - 17%) 0.165
OrHighNotMed 55.39 (3.3%) 60.07 (3.7%) 8.4% ( 1% - 16%) 0.000
PKLookup 108.78 (5.8%) 119.09 (6.0%) 9.5% ( -2% - 22%) 0.000
OrNotHighHigh 208.07 (3.5%) 235.47 (3.4%) 13.2% ( 6% - 20%) 0.000
HighTermTitleSort 7.70 (3.9%) 9.25 (3.4%) 20.1% ( 12% - 28%) 0.000
HighTermDayOfYearSort 61.64 (2.7%) 75.16 (4.6%) 21.9% ( 14% - 30%) 0.000
BrowseDateSSDVFacets 0.56 (15.7%) 0.69 (18.9%) 22.9% ( -10% - 68%) 0.000
BrowseDateTaxoFacets 2.30 (4.9%) 3.14 (8.5%) 37.0% ( 22% - 52%) 0.000
BrowseRandomLabelTaxoFacets 1.85 (3.5%) 2.55 (5.6%) 38.0% ( 27% - 48%) 0.000
BrowseDayOfYearTaxoFacets 2.28 (4.7%) 3.25 (6.7%) 42.5% ( 29% - 56%) 0.000
OrHighNotLow 165.24 (2.9%) 248.52 (4.7%) 50.4% ( 41% - 59%) 0.000
MedTerm 92.99 (2.9%) 144.38 (5.5%) 55.3% ( 45% - 65%) 0.000
IntNRQ 31.39 (29.9%) 50.82 (9.1%) 61.9% ( 17% - 144%) 0.000
BrowseMonthTaxoFacets 2.36 (11.1%) 6.79 (93.6%) 188.0% ( 75% - 329%) 0.000
TermDTSort 12.60 (4.2%) 68.61 (72.4%) 444.5% ( 352% - 544%) 0.000
3. Run # 3 : Result when NOT using new expression
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
HighTerm 171.08 (3.0%) 101.57 (2.5%) -40.6% ( -44% - -36%) 0.000
HighTermTitleSort 15.14 (3.9%) 12.25 (3.1%) -19.1% ( -25% - -12%) 0.000
LowSpanNear 170.82 (9.6%) 144.83 (13.7%) -15.2% ( -35% - 8%) 0.000
MedPhrase 148.79 (4.2%) 130.79 (3.5%) -12.1% ( -19% - -4%) 0.000
HighSloppyPhrase 31.34 (7.7%) 27.66 (7.1%) -11.7% ( -24% - 3%) 0.000
LowSloppyPhrase 57.48 (8.4%) 51.22 (7.6%) -10.9% ( -24% - 5%) 0.000
MedIntervalsOrdered 48.65 (7.5%) 43.49 (5.1%) -10.6% ( -21% - 2%) 0.000
BrowseMonthSSDVFacets 3.61 (16.0%) 3.33 (14.5%) -7.7% ( -32% - 27%) 0.108
LowPhrase 21.09 (6.6%) 19.57 (4.9%) -7.2% ( -17% - 4%) 0.000
BrowseRandomLabelSSDVFacets 2.21 (4.3%) 2.08 (5.1%) -5.8% ( -14% - 3%) 0.000
BrowseDayOfYearSSDVFacets 3.38 (11.7%) 3.19 (14.9%) -5.4% ( -28% - 24%) 0.203
HighIntervalsOrdered 6.48 (5.8%) 6.16 (5.9%) -5.0% ( -15% - 7%) 0.007
OrNotHighLow 1349.40 (3.9%) 1284.48 (4.1%) -4.8% ( -12% - 3%) 0.000
HighSpanNear 7.79 (2.6%) 7.42 (3.5%) -4.8% ( -10% - 1%) 0.000
PKLookup 113.95 (3.7%) 108.74 (4.6%) -4.6% ( -12% - 3%) 0.001
AndHighMedDayTaxoFacets 44.93 (3.5%) 43.23 (6.7%) -3.8% ( -13% - 6%) 0.025
Respell 35.71 (3.2%) 34.37 (5.6%) -3.7% ( -12% - 5%) 0.009
AndHighMed 96.91 (6.8%) 93.83 (3.9%) -3.2% ( -13% - 8%) 0.070
OrHighHigh 65.98 (8.2%) 64.12 (12.0%) -2.8% ( -21% - 18%) 0.387
AndHighLow 1356.33 (3.6%) 1319.07 (5.2%) -2.7% ( -11% - 6%) 0.052
AndHighHigh 66.89 (7.6%) 65.17 (6.9%) -2.6% ( -15% - 12%) 0.263
OrHighMedDayTaxoFacets 2.70 (7.1%) 2.63 (5.6%) -2.5% ( -14% - 11%) 0.225
HighPhrase 80.60 (4.9%) 79.11 (4.6%) -1.8% ( -10% - 8%) 0.221
AndHighHighDayTaxoFacets 10.10 (4.0%) 9.91 (4.4%) -1.8% ( -9% - 6%) 0.175
Wildcard 139.44 (2.7%) 137.27 (3.0%) -1.6% ( -7% - 4%) 0.083
HighTermTitleBDVSort 8.90 (4.0%) 8.76 (3.2%) -1.6% ( -8% - 5%) 0.174
MedTermDayTaxoFacets 16.02 (4.7%) 15.78 (3.2%) -1.5% ( -9% - 6%) 0.225
Fuzzy1 58.33 (3.4%) 57.52 (4.5%) -1.4% ( -8% - 6%) 0.274
Prefix3 495.32 (2.3%) 489.23 (1.9%) -1.2% ( -5% - 3%) 0.063
MedSloppyPhrase 36.19 (4.0%) 35.89 (2.5%) -0.8% ( -7% - 5%) 0.428
HighTermDayOfYearSort 60.62 (4.4%) 61.15 (4.2%) 0.9% ( -7% - 9%) 0.520
MedSpanNear 16.26 (4.8%) 16.55 (6.5%) 1.7% ( -9% - 13%) 0.341
LowIntervalsOrdered 57.41 (5.4%) 58.49 (4.2%) 1.9% ( -7% - 12%) 0.224
OrHighMed 130.42 (5.2%) 133.04 (5.5%) 2.0% ( -8% - 13%) 0.238
IntNRQ 70.41 (3.8%) 71.85 (5.2%) 2.0% ( -6% - 11%) 0.153
HighTermMonthSort 1101.95 (3.8%) 1129.04 (3.2%) 2.5% ( -4% - 9%) 0.025
OrHighNotMed 56.97 (3.3%) 58.71 (3.1%) 3.1% ( -3% - 9%) 0.003
Fuzzy2 49.17 (3.5%) 51.36 (4.3%) 4.5% ( -3% - 12%) 0.000
OrHighLow 290.33 (3.0%) 303.84 (2.3%) 4.7% ( 0% - 10%) 0.000
OrHighNotHigh 88.66 (3.7%) 93.58 (2.9%) 5.6% ( 0% - 12%) 0.000
LowTerm 209.03 (2.4%) 228.72 (2.9%) 9.4% ( 4% - 15%) 0.000
MedTerm 73.56 (3.1%) 86.92 (3.7%) 18.2% ( 10% - 25%) 0.000
OrNotHighMed 218.55 (2.1%) 267.79 (3.2%) 22.5% ( 16% - 28%) 0.000
OrNotHighHigh 45.70 (2.9%) 57.84 (4.7%) 26.6% ( 18% - 35%) 0.000
BrowseDateSSDVFacets 0.60 (13.1%) 0.77 (17.4%) 29.2% ( -1% - 68%) 0.000
OrHighNotLow 163.54 (2.5%) 214.57 (2.9%) 31.2% ( 25% - 37%) 0.000
TermDTSort 29.64 (4.0%) 39.09 (6.2%) 31.9% ( 20% - 43%) 0.000
BrowseRandomLabelTaxoFacets 1.92 (2.5%) 2.88 (46.6%) 49.7% ( 0% - 101%) 0.000
BrowseDateTaxoFacets 2.39 (6.1%) 3.79 (63.8%) 59.0% ( -10% - 137%) 0.000
BrowseDayOfYearTaxoFacets 2.37 (5.4%) 3.85 (67.2%) 62.4% ( -9% - 142%) 0.000
BrowseMonthTaxoFacets 2.49 (3.7%) 7.08 (93.6%) 183.8% ( 83% - 291%) 0.000
I'm seeing some crazy speedups for some tasks in the benchmarks (including PKLookup; a few got little slower) when using the new expression. Looking for you thoughts on this, Is my luceneutil changes right and could we expect this PR to affect other tasks than PKLookup as well like above?. These reported gains are so high that it got me a bit suspicious about the reported gains. I'd need an extra pair of eyes here.
@mikemccand @jpountz Any thoughts ?
do you see any gains vs main with the bug fixed (added parens)?
When not using the new expression i..e using "()" as we normally expect (below code changes) I still see high speedups for some tasks in the benchmarks(earlier comment) but this time PKLookup regresses ~4-5%with this change. ``
@@ -151,8 +151,8 @@ public class FuzzySet implements Accountable {
public ContainsResult contains(BytesRef value) {
long[] hash = StringHelper.murmurhash3_x64_128(value);
- int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1;
- int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;
+ int msb = (((int) hash[0] >>> Integer.SIZE) >>> 1) + (((int) hash[1] >>> Integer.SIZE) >>> 1);
+ int lsb = (((int) hash[0]) >>> 1) + (((int) hash[1]) >>> 1);
for (int i = 0; i < hashCount; i++) {
int bloomPos = (lsb + i * msb);
if (!mayContainValue(bloomPos)) {
@@ -219,8 +219,8 @@ public class FuzzySet implements Accountable {
*/
public void addValue(BytesRef value) {
long[] hash = StringHelper.murmurhash3_x64_128(value);
- int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1;
- int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;
+ int msb = (((int) hash[0] >>> Integer.SIZE) >>> 1) + (((int) hash[1] >>> Integer.SIZE) >>> 1);
+ int lsb = (((int) hash[0]) >>> 1) + (((int) hash[1]) >>> 1);
for (int i = 0; i < hashCount; i++) {
// Bitmasking using bloomSize is effectively a modulo operation.
int bloomPos = (lsb + i * msb) & bloomSize;
Could we maybe add a random test to assert that your impl in BloomPostingsFormat matches StringUtil's murmur3?
@mikemccand Do you mean to also change the expression in StringHelper#murmurhash3_x86_32 also to the new one and assert that? Or maybe you mean we should assert using a test that the new expression in FuzzySet is not altered later?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@mikemccand @jpountz Bumping this up. I don't know if there are any major concerns here? If not, I'd we awesome to include this also for 10.0. Looking for your thoughts? Thanks!
Thanks for the reminder @shubhamvishu!
I'm seeing some crazy speedups for some tasks in the benchmarks (including
PKLookup; a few got little slower) when using the new expression.
Hmm did you post the full results somewhere?
I'm seeing some crazy speedups for some tasks in the benchmarks (including
PKLookup; a few got little slower) when using the new expression.Hmm did you post the full results somewhere?
OK sorry I see them now. I have to click on those sophisticated arrows to expand the results, heh.
@jpountz I have addressed your comments now and kept the bit mixing logic simple as proposed initially. Let me know if the change looks good now. Thanks!
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
@jpountz I simplified the expression now. Let me know if the change looks good? Thanks!
BTW, why StringHelper is a abstract class? Can we make it final?
I think yes we can make it final as it has no abstract methods so need to have it abstract. Git blame says it was made abstract >20 years ago by Doug so maybe it just stayed like this since.
Thanks @shubhamvishu , I opend https://github.com/apache/lucene/pull/13928.