lucene Try using Murmurhash 3 for bloom filters

Description

We are currently using Murmurhash 2(MurmurHash64.java) in the bloom filters implementation in lucene where we also have Murmurhash 3 (the latest one available in the MurmurHash family of hash functions) and provides better performance, avalanche effect etc. It provides 128-bit variant which Murmurhash 2 doesn't. This PR aims to use Murmurhash 3 with bloom filters and see if that helps. Note : We are already using murmur hash 3 in BytesRefHash implementation (#6666)

Next steps :

[DONE] Run luceneutil benchmarks and share the results

Dec 01 '23 16:12 shubhamvishu

Below are the luceneutil benchmark results for wikimediumall. Looks all flat and good to me.

           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
         LowIntervalsOrdered        6.08      (5.0%)        6.00      (6.3%)   -1.3% ( -12% -   10%) 0.463
                  TermDTSort       97.64      (4.6%)       96.36      (5.6%)   -1.3% ( -11% -    9%) 0.417
                  OrHighHigh       28.51      (5.7%)       28.28      (5.7%)   -0.8% ( -11% -   11%) 0.651
           HighTermMonthSort     2287.85      (2.9%)     2270.76      (3.3%)   -0.7% (  -6% -    5%) 0.447
        MedTermDayTaxoFacets       14.70      (4.6%)       14.60      (4.0%)   -0.7% (  -8% -    8%) 0.620
   BrowseDayOfYearTaxoFacets        3.94     (28.1%)        3.91     (28.0%)   -0.7% ( -44% -   77%) 0.941
                     Respell       54.80      (1.8%)       54.45      (2.2%)   -0.6% (  -4% -    3%) 0.307
        BrowseDateTaxoFacets        3.92     (27.5%)        3.89     (27.4%)   -0.6% ( -43% -   74%) 0.942
       BrowseMonthSSDVFacets        4.45     (10.1%)        4.43      (9.7%)   -0.6% ( -18% -   21%) 0.845
                 MedSpanNear       91.41      (2.6%)       90.86      (3.7%)   -0.6% (  -6% -    5%) 0.554
         MedIntervalsOrdered        6.61      (3.7%)        6.57      (4.4%)   -0.5% (  -8% -    7%) 0.671
        HighIntervalsOrdered        6.73      (3.8%)        6.69      (4.2%)   -0.5% (  -8% -    7%) 0.691
           HighTermTitleSort      121.44      (4.0%)      120.97      (4.6%)   -0.4% (  -8% -    8%) 0.778
                HighSpanNear        5.31      (2.6%)        5.29      (3.2%)   -0.4% (  -6% -    5%) 0.701
                 LowSpanNear       50.68      (2.6%)       50.51      (3.3%)   -0.3% (  -6% -    5%) 0.731
                   OrHighMed       46.96      (3.3%)       46.83      (3.2%)   -0.3% (  -6% -    6%) 0.788
                OrHighNotLow      235.93      (5.2%)      235.59      (6.5%)   -0.1% ( -11% -   12%) 0.937
            HighSloppyPhrase        4.70      (4.7%)        4.70      (5.6%)   -0.1% (  -9% -   10%) 0.931
       BrowseMonthTaxoFacets        4.15     (35.3%)        4.15     (35.3%)   -0.1% ( -52% -  108%) 0.991
                      Fuzzy2       54.22      (1.1%)       54.15      (1.3%)   -0.1% (  -2% -    2%) 0.745
                OrNotHighLow      517.03      (1.6%)      516.41      (1.6%)   -0.1% (  -3% -    3%) 0.816
                      Fuzzy1       38.59      (1.2%)       38.56      (1.3%)   -0.1% (  -2% -    2%) 0.808
 BrowseRandomLabelSSDVFacets        2.77      (7.5%)        2.77      (7.5%)   -0.0% ( -14% -   16%) 0.986
                 AndHighHigh       29.06      (3.2%)       29.06      (3.1%)   -0.0% (  -6% -    6%) 0.991
                  AndHighMed       55.89      (2.9%)       55.91      (2.7%)    0.0% (  -5% -    5%) 0.970
                   OrHighLow      319.86      (2.2%)      320.00      (2.0%)    0.0% (  -4% -    4%) 0.945
                      IntNRQ       21.94      (2.5%)       21.96      (3.4%)    0.1% (  -5% -    6%) 0.905
                   LowPhrase      136.01      (4.1%)      136.20      (3.8%)    0.1% (  -7% -    8%) 0.908
                OrNotHighMed      208.82      (3.6%)      209.20      (3.9%)    0.2% (  -7% -    8%) 0.879
               OrNotHighHigh      241.45      (3.8%)      241.96      (4.9%)    0.2% (  -8% -    9%) 0.880
               OrHighNotHigh      191.17      (4.6%)      191.65      (5.7%)    0.3% (  -9% -   11%) 0.878
                  AndHighLow      300.40      (3.0%)      301.26      (2.7%)    0.3% (  -5% -    6%) 0.754
        BrowseDateSSDVFacets        0.90     (12.4%)        0.91     (10.9%)    0.3% ( -20% -   26%) 0.933
 BrowseRandomLabelTaxoFacets        3.34     (24.3%)        3.36     (24.6%)    0.4% ( -39% -   65%) 0.963
                OrHighNotMed      215.89      (5.5%)      216.69      (6.3%)    0.4% ( -10% -   12%) 0.843
                    Wildcard       61.64      (2.0%)       61.91      (2.2%)    0.4% (  -3% -    4%) 0.501
        HighTermTitleBDVSort        5.43      (4.3%)        5.46      (4.5%)    0.5% (  -8% -    9%) 0.729
                    PKLookup      136.14      (1.9%)      136.97      (1.9%)    0.6% (  -3% -    4%) 0.310
       HighTermDayOfYearSort      195.65      (4.1%)      196.91      (4.6%)    0.6% (  -7% -    9%) 0.641
                     MedTerm      455.25      (5.3%)      458.30      (5.8%)    0.7% (  -9% -   12%) 0.702
             LowSloppyPhrase       30.11      (2.0%)       30.36      (1.8%)    0.8% (  -2% -    4%) 0.171
                    HighTerm      367.19      (6.1%)      370.21      (6.3%)    0.8% ( -10% -   14%) 0.675
                     LowTerm      449.59      (4.1%)      453.37      (4.8%)    0.8% (  -7% -   10%) 0.549
                  HighPhrase       20.31      (7.3%)       20.50      (6.4%)    0.9% ( -11% -   15%) 0.675
                   MedPhrase       22.76      (6.9%)       22.98      (6.1%)    1.0% ( -11% -   14%) 0.643
     AndHighMedDayTaxoFacets       47.42      (1.8%)       47.88      (2.0%)    1.0% (  -2% -    4%) 0.109
      OrHighMedDayTaxoFacets        1.83      (3.6%)        1.85      (3.9%)    1.0% (  -6% -    8%) 0.390
    AndHighHighDayTaxoFacets       14.08      (2.4%)       14.27      (2.6%)    1.3% (  -3% -    6%) 0.084
             MedSloppyPhrase        2.77      (4.0%)        2.81      (4.1%)    1.4% (  -6% -    9%) 0.282
   BrowseDayOfYearSSDVFacets        3.72      (8.6%)        3.78     (11.9%)    1.8% ( -17% -   24%) 0.576
                     Prefix3       70.45      (8.3%)       72.84      (4.0%)    3.4% (  -8% -   17%) 0.099

Dec 01 '23 17:12 shubhamvishu

@mikemccand rightly pointed out that luceneutil doesn't use bloom filter postings format by default and we should enable it for id field and rerun the benchmarks to see the impact. Will rerun and share the results here.

Dec 02 '23 05:12 shubhamvishu

Maybe we should take advantage of this change to simplify this postings format, by no longer making the hash function configurable, removing the abstraction for hash functions, and cutting over to StringHelper to compute hashes instead of introducing a new implementation?

Dec 02 '23 12:12 jpountz

Maybe we should take advantage of this change to simplify this postings format, by no longer making the hash function configurable, removing the abstraction for hash functions, and cutting over to StringHelper to compute hashes instead of introducing a new implementation?

@jpountz Makes sense, +1 .to remove the abstraction Though I have a couple of points/questions here :

Do you mean to change StringHelper class to add support for 128 bit hash because currently it creates 32-bit hash with Murmur 3? or maybe moving BytesRefHash to also use 128-bit hash?
StringHelper doesn't seem like the most intuitive place for a hash function implementation like this. Do you think we should instead have or copy to something like HashHelper or HashUtil?

Dec 04 '23 08:12 shubhamvishu

So I ran the luceneutil benchmarks with -idFieldPostingsFormat BloomFilter but it was failing as there was no delegate posting format and it wasn't able to find the right postings format class using SPI. I tweaked this code (pasted below) to use the BloomFilteringPostingsFormat for the id field and also use the codecs jar (similar to how its done for core) and then all worked.

public PostingsFormat getPostingsFormatForField(String field) {
      PostingsFormat pf = PostingsFormat.forName(defaultPostingsFormat);
      if (field.equals("id")) {
            return new BloomFilteringPostingsFormat(pf);
      }
      return pf;
 }

Below are the wikimediumall benchmark results(ran twice to get more confidence) which shows ~7-9% performance improvement for PKLookup with p-value of 0.000

Run # 1

                     TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
 BrowseRandomLabelSSDVFacets        2.77      (7.0%)        2.72      (5.1%)   -1.7% ( -12% -   11%) 0.374
 BrowseRandomLabelTaxoFacets        3.35     (21.8%)        3.30     (18.1%)   -1.6% ( -34% -   48%) 0.803
             LowSloppyPhrase        6.11      (2.9%)        6.05      (2.5%)   -1.0% (  -6% -    4%) 0.248
           HighTermTitleSort      128.04      (3.7%)      126.78      (3.1%)   -1.0% (  -7% -    6%) 0.365
            HighSloppyPhrase       13.43      (2.7%)       13.31      (2.6%)   -0.9% (  -6% -    4%) 0.254
             MedSloppyPhrase        4.72      (3.4%)        4.68      (2.6%)   -0.8% (  -6% -    5%) 0.393
      OrHighMedDayTaxoFacets        2.99      (5.2%)        2.97      (3.5%)   -0.8% (  -9% -    8%) 0.582
        BrowseDateTaxoFacets        3.85     (19.0%)        3.82     (17.7%)   -0.8% ( -31% -   44%) 0.896
   BrowseDayOfYearTaxoFacets        3.86     (19.0%)        3.83     (17.9%)   -0.7% ( -31% -   44%) 0.900
                   OrHighMed       72.86      (2.0%)       72.58      (2.4%)   -0.4% (  -4% -    4%) 0.586
                  OrHighHigh       20.04      (3.1%)       19.98      (4.3%)   -0.3% (  -7% -    7%) 0.801
                  HighPhrase       24.66      (6.2%)       24.59      (6.6%)   -0.3% ( -12% -   13%) 0.882
                     Prefix3       72.76      (5.0%)       72.59      (4.3%)   -0.2% (  -9% -    9%) 0.873
                     MedTerm      379.26      (3.1%)      378.51      (5.1%)   -0.2% (  -8% -    8%) 0.882
                   MedPhrase       18.99      (5.5%)       18.95      (5.9%)   -0.2% ( -10% -   11%) 0.915
        BrowseDateSSDVFacets        0.90      (7.7%)        0.89      (7.4%)   -0.2% ( -14% -   16%) 0.944
                   LowPhrase       63.34      (2.4%)       63.27      (2.8%)   -0.1% (  -5% -    5%) 0.887
                HighSpanNear        6.18      (3.1%)        6.18      (3.2%)    0.0% (  -6% -    6%) 0.993
                      Fuzzy1       64.49      (1.1%)       64.52      (1.4%)    0.0% (  -2% -    2%) 0.919
                 LowSpanNear       15.60      (2.3%)       15.61      (2.5%)    0.1% (  -4% -    5%) 0.916
        HighTermTitleBDVSort        4.90      (4.2%)        4.90      (3.8%)    0.1% (  -7% -    8%) 0.938
        MedTermDayTaxoFacets        9.26      (5.6%)        9.27      (4.1%)    0.1% (  -9% -   10%) 0.948
     AndHighMedDayTaxoFacets       13.99      (1.7%)       14.02      (1.5%)    0.2% (  -3% -    3%) 0.764
    AndHighHighDayTaxoFacets        5.26      (2.7%)        5.27      (2.4%)    0.2% (  -4% -    5%) 0.819
                     Respell       29.96      (1.1%)       30.03      (1.8%)    0.2% (  -2% -    3%) 0.656
                     LowTerm      398.65      (2.5%)      399.67      (2.9%)    0.3% (  -5% -    5%) 0.765
                 MedSpanNear       38.84      (2.6%)       38.96      (3.0%)    0.3% (  -5% -    6%) 0.729
                    Wildcard       60.70      (1.6%)       60.89      (1.1%)    0.3% (  -2% -    3%) 0.463
       HighTermDayOfYearSort      200.04      (2.5%)      200.70      (3.7%)    0.3% (  -5% -    6%) 0.740
                   OrHighLow      252.74      (2.0%)      253.62      (2.4%)    0.3% (  -3% -    4%) 0.613
               OrNotHighHigh      157.04      (4.6%)      157.63      (4.2%)    0.4% (  -8% -    9%) 0.789
                 AndHighHigh       29.31      (2.4%)       29.42      (3.5%)    0.4% (  -5% -    6%) 0.678
                OrNotHighLow      290.56      (2.1%)      291.81      (1.7%)    0.4% (  -3% -    4%) 0.475
                OrNotHighMed      221.77      (3.5%)      222.84      (2.9%)    0.5% (  -5% -    7%) 0.633
               OrHighNotHigh      167.01      (4.8%)      167.85      (4.5%)    0.5% (  -8% -   10%) 0.731
                    HighTerm      279.21      (4.3%)      280.66      (6.5%)    0.5% (  -9% -   11%) 0.767
                  AndHighLow      374.84      (1.9%)      377.10      (1.8%)    0.6% (  -3% -    4%) 0.308
           HighTermMonthSort     2378.06      (3.6%)     2392.49      (4.1%)    0.6% (  -6% -    8%) 0.618
         LowIntervalsOrdered       12.78      (2.4%)       12.86      (2.8%)    0.6% (  -4% -    6%) 0.443
                OrHighNotLow      269.59      (4.8%)      271.34      (4.9%)    0.6% (  -8% -   10%) 0.672
                      IntNRQ       18.20      (5.9%)       18.32      (5.4%)    0.7% ( -10% -   12%) 0.709
                  AndHighMed       37.67      (2.5%)       37.93      (3.5%)    0.7% (  -5% -    6%) 0.459
         MedIntervalsOrdered        1.80      (3.5%)        1.82      (3.7%)    0.8% (  -6% -    8%) 0.456
                OrHighNotMed      248.23      (4.7%)      250.34      (4.5%)    0.9% (  -7% -   10%) 0.556
                      Fuzzy2       35.10      (1.2%)       35.42      (1.2%)    0.9% (  -1% -    3%) 0.016
       BrowseMonthTaxoFacets        4.13     (30.7%)        4.17     (34.8%)    1.0% ( -49% -   96%) 0.924
       BrowseMonthSSDVFacets        4.37      (9.6%)        4.45      (8.9%)    1.9% ( -15% -   22%) 0.506
        HighIntervalsOrdered        1.58      (4.6%)        1.61      (5.6%)    2.0% (  -7% -   12%) 0.228
                  TermDTSort       96.97      (3.2%)       98.92      (5.0%)    2.0% (  -6% -   10%) 0.132
   BrowseDayOfYearSSDVFacets        3.76      (9.6%)        3.85      (6.5%)    2.3% ( -12% -   20%) 0.366
                    PKLookup      106.69      (1.5%)      114.71      (1.5%)    7.5% (   4% -   10%) 0.000

Run # 2

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
      BrowseMonthTaxoFacets        4.08     (31.7%)        3.80      (1.8%)   -6.7% ( -30% -   39%) 0.342
BrowseRandomLabelTaxoFacets        3.30     (21.9%)        3.16      (2.3%)   -4.2% ( -23% -   25%) 0.393
       BrowseDateTaxoFacets        3.85     (20.0%)        3.72      (6.1%)   -3.5% ( -24% -   28%) 0.456
  BrowseDayOfYearTaxoFacets        3.86     (20.0%)        3.73      (6.2%)   -3.3% ( -24% -   28%) 0.476
                 OrHighHigh       26.35      (6.9%)       25.88      (6.8%)   -1.8% ( -14% -   12%) 0.404
      HighTermDayOfYearSort      209.99      (3.7%)      206.48      (4.1%)   -1.7% (  -9% -    6%) 0.177
                  OrHighLow      273.41      (2.3%)      270.80      (3.0%)   -1.0% (  -6% -    4%) 0.252
          HighTermMonthSort     2346.36      (3.2%)     2326.36      (3.6%)   -0.9% (  -7% -    6%) 0.427
       HighTermTitleBDVSort        4.83      (3.7%)        4.80      (3.4%)   -0.7% (  -7% -    6%) 0.522
                    Prefix3      584.55      (3.3%)      580.95      (3.1%)   -0.6% (  -6% -    5%) 0.541
                  OrHighMed       78.07      (2.7%)       77.60      (2.9%)   -0.6% (  -6% -    5%) 0.489
                 TermDTSort       92.32      (4.1%)       91.89      (4.4%)   -0.5% (  -8% -    8%) 0.732
          HighTermTitleSort      138.64      (2.4%)      138.05      (2.5%)   -0.4% (  -5% -    4%) 0.580
                AndHighHigh       20.00      (5.1%)       19.93      (4.3%)   -0.4% (  -9% -    9%) 0.791
       BrowseDateSSDVFacets        0.90      (9.5%)        0.90      (8.0%)   -0.4% ( -16% -   18%) 0.891
                     Fuzzy2       39.56      (1.2%)       39.41      (1.4%)   -0.4% (  -2% -    2%) 0.362
   AndHighHighDayTaxoFacets        2.08      (4.1%)        2.07      (4.0%)   -0.3% (  -8% -    8%) 0.798
                     Fuzzy1       66.33      (1.0%)       66.17      (1.1%)   -0.2% (  -2% -    1%) 0.471
                   Wildcard       40.31      (3.9%)       40.22      (3.7%)   -0.2% (  -7% -    7%) 0.856
           HighSloppyPhrase       11.01      (1.8%)       11.00      (2.2%)   -0.1% (  -4% -    3%) 0.852
               OrHighNotLow      219.55      (7.5%)      219.38      (7.3%)   -0.1% ( -13% -   15%) 0.974
                    Respell       50.51      (1.6%)       50.48      (1.5%)   -0.1% (  -3% -    3%) 0.915
                     IntNRQ       18.46      (8.9%)       18.46      (9.1%)   -0.0% ( -16% -   19%) 0.988
                 AndHighMed       82.85      (3.1%)       82.84      (2.5%)   -0.0% (  -5% -    5%) 0.982
               OrNotHighLow      512.93      (2.0%)      512.86      (1.9%)   -0.0% (  -3% -    3%) 0.982
                LowSpanNear       64.37      (2.4%)       64.44      (2.7%)    0.1% (  -4% -    5%) 0.886
              OrNotHighHigh      278.80      (6.4%)      279.28      (6.0%)    0.2% ( -11% -   13%) 0.931
                    LowTerm      351.93      (4.1%)      352.53      (4.4%)    0.2% (  -7% -    9%) 0.898
               OrNotHighMed      201.78      (5.3%)      202.14      (5.1%)    0.2% (  -9% -   11%) 0.913
              OrHighNotHigh      196.39      (6.5%)      196.74      (6.5%)    0.2% ( -11% -   14%) 0.930
            LowSloppyPhrase        4.06      (4.1%)        4.07      (4.6%)    0.2% (  -8% -    9%) 0.865
    AndHighMedDayTaxoFacets       29.95      (1.5%)       30.04      (1.7%)    0.3% (  -2% -    3%) 0.577
     OrHighMedDayTaxoFacets        3.47      (5.7%)        3.48      (4.3%)    0.3% (  -9% -   10%) 0.857
        MedIntervalsOrdered        7.68      (6.1%)        7.71      (6.6%)    0.4% ( -11% -   13%) 0.858
                    MedTerm      462.78      (5.3%)      464.47      (6.6%)    0.4% ( -10% -   12%) 0.847
                 AndHighLow      274.17      (2.2%)      275.22      (2.5%)    0.4% (  -4% -    5%) 0.606
               HighSpanNear        3.88      (3.9%)        3.90      (4.8%)    0.5% (  -7% -    9%) 0.738
  BrowseDayOfYearSSDVFacets        3.60      (8.3%)        3.62      (9.9%)    0.5% ( -16% -   20%) 0.863
               OrHighNotMed      286.50      (6.4%)      287.93      (6.3%)    0.5% ( -11% -   14%) 0.803
        LowIntervalsOrdered        4.82      (3.9%)        4.85      (3.8%)    0.5% (  -6% -    8%) 0.678
                MedSpanNear        4.81      (3.1%)        4.84      (4.0%)    0.6% (  -6% -    7%) 0.616
       HighIntervalsOrdered        4.37      (4.8%)        4.40      (5.0%)    0.6% (  -8% -   10%) 0.700
       MedTermDayTaxoFacets       11.79      (2.7%)       11.86      (3.1%)    0.6% (  -5% -    6%) 0.507
            MedSloppyPhrase       36.65      (5.2%)       36.95      (4.6%)    0.8% (  -8% -   11%) 0.592
                   HighTerm      318.60      (6.5%)      322.10      (7.5%)    1.1% ( -12% -   16%) 0.621
                  LowPhrase       31.99      (4.0%)       32.35      (2.0%)    1.1% (  -4% -    7%) 0.255
      BrowseMonthSSDVFacets        4.35     (13.5%)        4.41     (14.8%)    1.3% ( -23% -   34%) 0.771
                  MedPhrase       54.81      (3.5%)       55.56      (2.4%)    1.4% (  -4% -    7%) 0.147
BrowseRandomLabelSSDVFacets        2.68      (8.5%)        2.73      (8.2%)    1.6% ( -13% -   19%) 0.556
                 HighPhrase        3.07      (6.6%)        3.15      (4.3%)    2.5% (  -7% -   14%) 0.150
                   PKLookup      105.96      (1.1%)      115.03      (1.4%)    8.6% (   5% -   11%) 0.000

Dec 04 '23 08:12 shubhamvishu

I also ran the benchmarks for wikibigall. Here also we see similarly consistent 7% improvement for PKLookup with 0.000 p-value. Below are the benchmark results :

                    TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
      BrowseMonthSSDVFacets       23.50      (9.9%)       22.82     (10.6%)   -2.9% ( -21% -   19%) 0.374
            LowSloppyPhrase        7.89      (6.1%)        7.84      (7.2%)   -0.7% ( -13% -   13%) 0.737
                 TermDTSort      169.47      (5.8%)      168.46      (5.6%)   -0.6% ( -11% -   11%) 0.740
       HighTermTitleBDVSort       10.66      (6.2%)       10.60      (6.2%)   -0.6% ( -12% -   12%) 0.770
  BrowseDayOfYearSSDVFacets       20.09     (14.5%)       19.98     (17.1%)   -0.6% ( -28% -   36%) 0.910
                  OrHighMed      184.16      (3.2%)      183.28      (3.7%)   -0.5% (  -7% -    6%) 0.660
                 OrHighHigh       51.72      (5.6%)       51.48      (5.8%)   -0.5% ( -11% -   11%) 0.796
BrowseRandomLabelTaxoFacets       11.19      (5.1%)       11.14      (4.5%)   -0.4% (  -9% -    9%) 0.772
            MedSloppyPhrase       17.91      (3.5%)       17.84      (4.2%)   -0.4% (  -7% -    7%) 0.752
                  LowPhrase       13.42      (5.3%)       13.37      (6.3%)   -0.4% ( -11% -   11%) 0.848
                 AndHighLow      529.38      (2.2%)      527.55      (2.9%)   -0.3% (  -5% -    4%) 0.674
       BrowseDateSSDVFacets        4.47     (11.1%)        4.46     (10.4%)   -0.3% ( -19% -   23%) 0.924
       BrowseDateTaxoFacets       12.59      (4.0%)       12.55      (3.5%)   -0.3% (  -7% -    7%) 0.810
     OrHighMedDayTaxoFacets       10.06      (4.9%)       10.04      (4.7%)   -0.2% (  -9% -    9%) 0.914
      HighTermDayOfYearSort      314.20      (1.9%)      313.90      (1.7%)   -0.1% (  -3% -    3%) 0.867
      BrowseMonthTaxoFacets       12.34      (3.4%)       12.35      (2.8%)    0.1% (  -5% -    6%) 0.903
                 AndHighMed      164.82      (4.1%)      165.05      (4.3%)    0.1% (  -7% -    8%) 0.916
                 HighPhrase       40.51      (4.2%)       40.57      (5.6%)    0.1% (  -9% -   10%) 0.926
                    Prefix3      793.33      (2.5%)      794.74      (2.8%)    0.2% (  -5% -    5%) 0.833
               OrNotHighLow      731.50      (2.6%)      733.25      (2.4%)    0.2% (  -4% -    5%) 0.760
                  OrHighLow      394.15      (4.5%)      395.24      (4.5%)    0.3% (  -8% -    9%) 0.845
       MedTermDayTaxoFacets       27.09      (5.4%)       27.17      (4.6%)    0.3% (  -9% -   10%) 0.857
                    Respell       37.26      (1.9%)       37.40      (1.9%)    0.4% (  -3% -    4%) 0.540
                  MedPhrase       68.95      (3.2%)       69.21      (4.5%)    0.4% (  -7% -    8%) 0.764
          HighTermTitleSort      103.69      (2.8%)      104.13      (3.2%)    0.4% (  -5% -    6%) 0.659
               HighSpanNear        2.72      (5.7%)        2.74      (3.9%)    0.5% (  -8% -   10%) 0.755
           HighSloppyPhrase        7.67      (5.8%)        7.71      (6.5%)    0.5% ( -11% -   13%) 0.801
                   Wildcard       18.57      (2.9%)       18.66      (2.8%)    0.5% (  -5% -    6%) 0.581
                     Fuzzy1       73.74      (1.2%)       74.11      (1.4%)    0.5% (  -2% -    3%) 0.231
   AndHighHighDayTaxoFacets       15.25      (2.1%)       15.32      (2.8%)    0.5% (  -4% -    5%) 0.522
    AndHighMedDayTaxoFacets       20.82      (2.1%)       20.93      (2.8%)    0.5% (  -4% -    5%) 0.524
                AndHighHigh       22.11      (5.7%)       22.23      (5.4%)    0.5% ( -10% -   12%) 0.759
                   HighTerm      255.66      (5.9%)      257.09      (4.9%)    0.6% (  -9% -   12%) 0.744
                     Fuzzy2       62.45      (1.1%)       62.81      (1.4%)    0.6% (  -1% -    3%) 0.149
BrowseRandomLabelSSDVFacets       15.32      (6.9%)       15.42     (10.6%)    0.7% ( -15% -   19%) 0.816
                    MedTerm      286.47      (6.0%)      288.39      (4.6%)    0.7% (  -9% -   11%) 0.689
                     IntNRQ      119.53      (3.3%)      120.36      (5.0%)    0.7% (  -7% -    9%) 0.600
                    LowTerm      461.65      (5.4%)      464.97      (3.9%)    0.7% (  -8% -   10%) 0.631
          HighTermMonthSort     2546.87      (3.3%)     2566.25      (2.4%)    0.8% (  -4% -    6%) 0.403
        LowIntervalsOrdered       18.69      (3.6%)       18.84      (3.4%)    0.8% (  -5% -    8%) 0.468
  BrowseDayOfYearTaxoFacets       13.53      (6.0%)       13.64      (6.3%)    0.8% ( -10% -   13%) 0.669
               OrNotHighMed      114.22      (3.0%)      115.51      (2.5%)    1.1% (  -4% -    6%) 0.195
               OrHighNotMed      278.93      (4.7%)      282.15      (4.8%)    1.2% (  -7% -   11%) 0.443
              OrNotHighHigh       81.05      (5.1%)       82.09      (4.5%)    1.3% (  -7% -   11%) 0.402
              OrHighNotHigh      151.36      (5.3%)      153.39      (4.7%)    1.3% (  -8% -   12%) 0.398
        MedIntervalsOrdered       17.52      (5.0%)       17.78      (4.4%)    1.4% (  -7% -   11%) 0.333
       HighIntervalsOrdered        3.42      (4.7%)        3.48      (4.4%)    1.8% (  -7% -   11%) 0.223
               OrHighNotLow      231.81      (5.6%)      236.03      (5.4%)    1.8% (  -8% -   13%) 0.297
                MedSpanNear       11.01      (8.1%)       11.23      (5.5%)    2.0% ( -10% -   16%) 0.367
                LowSpanNear        4.54     (11.5%)        4.67      (7.3%)    3.0% ( -14% -   24%) 0.326
                   PKLookup      122.20      (1.3%)      130.59      (2.0%)    6.9% (   3% -   10%) 0.000

Dec 04 '23 08:12 shubhamvishu

Do you mean to change StringHelper class to add support for 128 bit hash because currently it creates 32-bit hash with Murmur 3?

Sorry, I had overlooked that StringHelper only used 32 bits for its hash. If it's not a good fit, I'm good with hardcoding murmur3 in this postings format.

Dec 04 '23 21:12 jpountz

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jan 08 '24 12:01 github-actions[bot]

@shubhamvishu -- I think the PKLookup gains here are compelling, and there is consensus to make this improvement only to the BloomFilterPostingsFormat? But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...

Jun 24 '24 23:06 mikemccand

I think the PKLookup gains here are compelling, and there is consensus to make this improvement only to the BloomFilterPostingsFormat?

Sure @mikemccand , I'll update this PR with the comments Adrien had earlier. I'll also post the new results shortly(this week) as its been sometime we last ran the benchmarks.

But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...

For this I think we can open another issue to understand why we see different results.

Jun 25 '24 11:06 shubhamvishu

But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...

The expression ((int) A) >>> 1 + ((int) B) >>> 1 is equivalent to (((int) A) >>> (1 + ((int) B))) >>> 1 when we see the operator precedence and this seems to be causing the hash values to be bigger and sometimes smaller(I manually tried a couple of them) as opposed to using simple ((A/2) + (B/2)) which does not achieving that and produces large hashes everytime. I think this small tweak is generating hashes spread more uniformly in the range and leading to less collisions probably?.Another thing that helps is that the ideally we should have seen a regression if this was generating some garbage or similar hashes but we consistently seen improvement in both wikimediumall and wikibigall earlier. I'll share the latest benchmarks as well when they are completed. Any thoughts? @mikemccand @jpountz

Jul 01 '24 03:07 shubhamvishu

But understanding why supposedly equivalent expressions yield such a different benchmark result remains ...

The expression ((int) A) >>> 1 + ((int) B) >>> 1 is equivalent to (((int) A) >>> (1 + ((int) B))) >>> 1 when we see the operator precedence

OK I see. Yes indeed the expression is buggy (does not match murmur3 hash function) without the added ( ... ) thanks to java's operator precedence.

It would indeed be surprising if you accidentally discovered a better hashing function than murmur3 with this bug. It is curious that you see consistently better PKLookup performance ... do you see any gains vs main with the bug fixed (added parens)?

Could we maybe add a random test to assert that your impl in BloomPostingsFormat matches StringUtil's murmur3?

Jul 01 '24 10:07 mikemccand

do you see any gains vs main with the bug fixed (added parens)?

I ran the luceneutil benchmarks with this PR again and also when fixing the expression using "()". It required few changes in luceneutil to enable BloomFilteringPostingsFormat for id field (which are in the below commit).

Required luceneutil changes - https://github.com/shubhamvishu/luceneutil/commit/f675c6611a678f932360a89883b22f9ba9b75a8b

Here are the the latest results that I'm seeing with this PR :

1. Run # 1 : Result when using new expression i.e. this PR

                        TaskQPS   baseline  StdDevQPS  my_modified_version   StdDev  Pct diff      p-value
                   OrHighNotMed      151.52      (3.2%)      105.32      (2.6%)  -30.5% ( -35% -  -25%) 0.000
            MedIntervalsOrdered       35.15      (8.4%)       30.00      (7.7%)  -14.7% ( -28% -    1%) 0.000
                     HighPhrase       43.82      (4.9%)       37.83      (3.4%)  -13.7% ( -20% -   -5%) 0.000
                      MedPhrase       36.18     (11.7%)       31.64      (4.6%)  -12.5% ( -25% -    4%) 0.000
           HighIntervalsOrdered       12.19      (8.6%)       10.79      (6.4%)  -11.5% ( -24% -    3%) 0.000
      BrowseDayOfYearSSDVFacets        3.49     (13.7%)        3.11     (12.2%)  -10.8% ( -32% -   17%) 0.008
              HighTermMonthSort     1160.29      (4.2%)     1064.20      (2.8%)   -8.3% ( -14% -   -1%) 0.000
          BrowseMonthSSDVFacets        3.49     (15.0%)        3.27     (18.0%)   -6.5% ( -34% -   31%) 0.216
                    AndHighHigh       90.31      (5.8%)       85.38      (5.4%)   -5.5% ( -15% -    6%) 0.002
    BrowseRandomLabelSSDVFacets        2.17      (5.1%)        2.06      (7.9%)   -5.4% ( -17% -    7%) 0.010
           MedTermDayTaxoFacets        8.82      (6.5%)        8.35      (5.6%)   -5.4% ( -16% -    7%) 0.005
                         IntNRQ       60.02      (9.2%)       57.07     (13.7%)   -4.9% ( -25% -   19%) 0.184
          HighTermDayOfYearSort       75.06      (2.8%)       71.92      (4.4%)   -4.2% ( -11% -    3%) 0.000
                    MedSpanNear       22.04      (5.1%)       21.25      (4.2%)   -3.6% ( -12% -    6%) 0.015
                        Prefix3      417.51      (3.3%)      402.97      (3.5%)   -3.5% (  -9% -    3%) 0.001
                         Fuzzy1       66.49      (3.1%)       64.58      (2.8%)   -2.9% (  -8% -    3%) 0.002
                         Fuzzy2       60.77      (3.9%)       59.11      (5.2%)   -2.7% ( -11% -    6%) 0.061
                     AndHighLow      750.92      (3.5%)      738.33      (3.8%)   -1.7% (  -8% -    5%) 0.150
         OrHighMedDayTaxoFacets        3.19      (6.3%)        3.14      (6.8%)   -1.5% ( -13% -   12%) 0.481
        AndHighMedDayTaxoFacets        7.77      (3.7%)        7.68      (4.6%)   -1.2% (  -9% -    7%) 0.368
                        Respell       42.71      (4.1%)       42.45      (4.5%)   -0.6% (  -8% -    8%) 0.659
       AndHighHighDayTaxoFacets        2.68      (4.3%)        2.66      (4.1%)   -0.6% (  -8% -    8%) 0.673
               HighSloppyPhrase       18.32      (4.5%)       18.26      (6.0%)   -0.3% ( -10% -   10%) 0.851
                   OrNotHighLow      947.96      (4.4%)      946.42      (2.7%)   -0.2% (  -7% -    7%) 0.889
                LowSloppyPhrase       24.72      (4.3%)       24.80      (5.8%)    0.3% (  -9% -   10%) 0.834
                     OrHighHigh       63.44      (8.1%)       64.04      (6.8%)    0.9% ( -12% -   17%) 0.694
            LowIntervalsOrdered       47.35      (6.2%)       47.85      (7.0%)    1.1% ( -11% -   15%) 0.608
                     AndHighMed      175.58      (3.1%)      177.59      (3.6%)    1.1% (  -5% -    8%) 0.286
                  OrHighNotHigh       57.34      (2.5%)       58.02      (2.3%)    1.2% (  -3% -    6%) 0.125
                    LowSpanNear       54.84      (3.3%)       55.69      (4.3%)    1.6% (  -5% -    9%) 0.201
                      OrHighLow      366.06      (3.8%)      372.02      (2.7%)    1.6% (  -4% -    8%) 0.119
                MedSloppyPhrase       24.57      (4.0%)       24.99      (5.0%)    1.7% (  -6% -   11%) 0.224
                       Wildcard      161.09      (2.6%)      165.71      (3.1%)    2.9% (  -2% -    8%) 0.002
                      OrHighMed      116.69      (6.4%)      120.28      (4.6%)    3.1% (  -7% -   15%) 0.082
           HighTermTitleBDVSort        7.68      (2.7%)        7.92      (4.0%)    3.1% (  -3% -   10%) 0.004
                      LowPhrase       21.75      (6.3%)       22.62      (7.7%)    4.0% (  -9% -   19%) 0.072
                        LowTerm      197.26      (2.9%)      210.16      (2.8%)    6.5% (   0% -   12%) 0.000
              HighTermTitleSort       14.21      (4.2%)       15.16      (3.7%)    6.7% (  -1% -   15%) 0.000
                   HighSpanNear        4.45      (3.5%)        4.80      (2.6%)    8.0% (   1% -   14%) 0.000
                       HighTerm      111.71      (3.9%)      121.04      (2.8%)    8.3% (   1% -   15%) 0.000
                   OrNotHighMed      134.75      (3.2%)      146.33      (2.7%)    8.6% (   2% -   14%) 0.000
                       PKLookup      110.99      (4.4%)      124.24      (6.0%)   11.9% (   1% -   23%) 0.000
                  OrNotHighHigh       76.76      (3.0%)       91.73      (4.0%)   19.5% (  12% -   27%) 0.000
           BrowseDateSSDVFacets        0.57     (13.9%)        0.72     (19.1%)   25.6% (  -6% -   68%) 0.000
    BrowseRandomLabelTaxoFacets        1.98      (6.4%)        2.77     (36.1%)   39.9% (  -2% -   88%) 0.000
                        MedTerm      101.03      (4.3%)      145.90      (5.6%)   44.4% (  33% -   56%) 0.000
           BrowseDateTaxoFacets        2.50      (9.8%)        3.63     (41.4%)   45.2% (  -5% -  106%) 0.000
      BrowseDayOfYearTaxoFacets        2.51      (9.3%)        3.70     (42.2%)   47.8% (  -3% -  109%) 0.000
                   OrHighNotLow      100.87      (3.5%)      189.07      (7.6%)   87.4% (  73% -  102%) 0.000
                     TermDTSort       22.89      (2.9%)       54.10     (15.4%)  136.4% ( 114% -  159%) 0.000
          BrowseMonthTaxoFacets        2.41     (10.9%)        6.66     (93.8%)  176.2% (  64% -  315%) 0.000

2. Run # 2 : Result when using new expression i.e. this PR

                        TaskQPS   baseline  StdDevQPS  my_modified_version   StdDev  Pct diff      p-value
                    LowSpanNear       76.20      (2.5%)       67.29      (5.3%)  -11.7% ( -19% -   -3%) 0.000
            LowIntervalsOrdered       51.58      (5.5%)       47.22      (6.0%)   -8.5% ( -18% -    3%) 0.000
          BrowseMonthSSDVFacets        3.49     (20.6%)        3.20     (22.0%)   -8.1% ( -42% -   43%) 0.232
      BrowseDayOfYearSSDVFacets        3.17      (8.9%)        2.94     (13.4%)   -7.2% ( -27% -   16%) 0.045
    BrowseRandomLabelSSDVFacets        2.18     (12.8%)        2.03     (11.1%)   -7.0% ( -27% -   19%) 0.065
                     AndHighLow      738.26      (4.8%)      687.21      (5.3%)   -6.9% ( -16% -    3%) 0.000
                LowSloppyPhrase       91.80      (4.0%)       85.81      (4.9%)   -6.5% ( -14% -    2%) 0.000
                     OrHighHigh       72.55      (9.6%)       68.80      (8.5%)   -5.2% ( -21% -   14%) 0.073
           MedTermDayTaxoFacets        6.88      (6.5%)        6.57      (8.1%)   -4.6% ( -17% -   10%) 0.050
                      MedPhrase       73.44      (2.7%)       70.35      (5.9%)   -4.2% ( -12% -    4%) 0.004
                        Respell       26.61      (4.0%)       25.53      (5.6%)   -4.1% ( -13% -    5%) 0.008
                        LowTerm      202.01      (3.5%)      193.99      (2.6%)   -4.0% (  -9% -    2%) 0.000
                MedSloppyPhrase       35.32      (5.4%)       33.94      (7.1%)   -3.9% ( -15% -    9%) 0.050
                        Prefix3      406.66      (6.5%)      392.60      (5.2%)   -3.5% ( -14% -    8%) 0.063
        AndHighMedDayTaxoFacets       19.45      (4.9%)       18.92      (5.3%)   -2.8% ( -12% -    7%) 0.087
       AndHighHighDayTaxoFacets        2.77      (5.5%)        2.69      (6.7%)   -2.8% ( -14% -   10%) 0.156
         OrHighMedDayTaxoFacets        3.34      (5.0%)        3.25      (7.6%)   -2.7% ( -14% -   10%) 0.184
                       Wildcard      768.29      (3.6%)      751.35      (3.9%)   -2.2% (  -9% -    5%) 0.060
                         Fuzzy2       44.04      (5.4%)       43.09      (5.2%)   -2.1% ( -12% -    8%) 0.201
                   OrNotHighLow      853.76      (3.0%)      835.98      (2.6%)   -2.1% (  -7% -    3%) 0.019
                    AndHighHigh       85.25      (7.5%)       83.76      (6.2%)   -1.7% ( -14% -   12%) 0.422
              HighTermMonthSort     1068.18      (3.6%)     1050.07      (3.9%)   -1.7% (  -8% -    6%) 0.153
               HighSloppyPhrase        5.96      (3.8%)        5.94      (3.6%)   -0.4% (  -7% -    7%) 0.737
           HighTermTitleBDVSort        7.39      (3.9%)        7.36      (2.9%)   -0.3% (  -6% -    6%) 0.785
                       HighTerm       83.27      (3.3%)       83.02      (3.5%)   -0.3% (  -6% -    6%) 0.781
                      LowPhrase       92.82      (3.7%)       92.58      (2.1%)   -0.3% (  -5% -    5%) 0.791
                      OrHighLow      266.30      (2.0%)      268.48      (2.3%)    0.8% (  -3% -    5%) 0.229
           HighIntervalsOrdered        5.48      (5.7%)        5.53      (5.5%)    0.9% (  -9% -   12%) 0.621
            MedIntervalsOrdered       26.98      (5.8%)       27.26      (6.6%)    1.0% ( -10% -   14%) 0.597
                    MedSpanNear        4.02      (2.6%)        4.09      (3.5%)    1.6% (  -4% -    7%) 0.104
                      OrHighMed       65.75      (7.5%)       66.80      (6.5%)    1.6% ( -11% -   16%) 0.472
                     HighPhrase      120.64      (4.8%)      122.75      (4.2%)    1.8% (  -6% -   11%) 0.222
                   HighSpanNear       11.14      (3.6%)       11.34      (2.7%)    1.8% (  -4% -    8%) 0.073
                  OrHighNotHigh       83.12      (2.8%)       84.65      (2.3%)    1.8% (  -3% -    7%) 0.022
                   OrNotHighMed      147.67      (2.2%)      150.48      (2.6%)    1.9% (  -2% -    6%) 0.011
                     AndHighMed      168.61      (3.1%)      173.38      (3.4%)    2.8% (  -3% -    9%) 0.006
                         Fuzzy1       75.52      (7.4%)       77.77      (6.1%)    3.0% (  -9% -   17%) 0.165
                   OrHighNotMed       55.39      (3.3%)       60.07      (3.7%)    8.4% (   1% -   16%) 0.000
                       PKLookup      108.78      (5.8%)      119.09      (6.0%)    9.5% (  -2% -   22%) 0.000
                  OrNotHighHigh      208.07      (3.5%)      235.47      (3.4%)   13.2% (   6% -   20%) 0.000
              HighTermTitleSort        7.70      (3.9%)        9.25      (3.4%)   20.1% (  12% -   28%) 0.000
          HighTermDayOfYearSort       61.64      (2.7%)       75.16      (4.6%)   21.9% (  14% -   30%) 0.000
           BrowseDateSSDVFacets        0.56     (15.7%)        0.69     (18.9%)   22.9% ( -10% -   68%) 0.000
           BrowseDateTaxoFacets        2.30      (4.9%)        3.14      (8.5%)   37.0% (  22% -   52%) 0.000
    BrowseRandomLabelTaxoFacets        1.85      (3.5%)        2.55      (5.6%)   38.0% (  27% -   48%) 0.000
      BrowseDayOfYearTaxoFacets        2.28      (4.7%)        3.25      (6.7%)   42.5% (  29% -   56%) 0.000
                   OrHighNotLow      165.24      (2.9%)      248.52      (4.7%)   50.4% (  41% -   59%) 0.000
                        MedTerm       92.99      (2.9%)      144.38      (5.5%)   55.3% (  45% -   65%) 0.000
                         IntNRQ       31.39     (29.9%)       50.82      (9.1%)   61.9% (  17% -  144%) 0.000
          BrowseMonthTaxoFacets        2.36     (11.1%)        6.79     (93.6%)  188.0% (  75% -  329%) 0.000
                     TermDTSort       12.60      (4.2%)       68.61     (72.4%)  444.5% ( 352% -  544%) 0.000

3. Run # 3 : Result when NOT using new expression

                        TaskQPS   baseline  StdDevQPS  my_modified_version   StdDev  Pct diff      p-value
                       HighTerm      171.08      (3.0%)      101.57      (2.5%)  -40.6% ( -44% -  -36%) 0.000
              HighTermTitleSort       15.14      (3.9%)       12.25      (3.1%)  -19.1% ( -25% -  -12%) 0.000
                    LowSpanNear      170.82      (9.6%)      144.83     (13.7%)  -15.2% ( -35% -    8%) 0.000
                      MedPhrase      148.79      (4.2%)      130.79      (3.5%)  -12.1% ( -19% -   -4%) 0.000
               HighSloppyPhrase       31.34      (7.7%)       27.66      (7.1%)  -11.7% ( -24% -    3%) 0.000
                LowSloppyPhrase       57.48      (8.4%)       51.22      (7.6%)  -10.9% ( -24% -    5%) 0.000
            MedIntervalsOrdered       48.65      (7.5%)       43.49      (5.1%)  -10.6% ( -21% -    2%) 0.000
          BrowseMonthSSDVFacets        3.61     (16.0%)        3.33     (14.5%)   -7.7% ( -32% -   27%) 0.108
                      LowPhrase       21.09      (6.6%)       19.57      (4.9%)   -7.2% ( -17% -    4%) 0.000
    BrowseRandomLabelSSDVFacets        2.21      (4.3%)        2.08      (5.1%)   -5.8% ( -14% -    3%) 0.000
      BrowseDayOfYearSSDVFacets        3.38     (11.7%)        3.19     (14.9%)   -5.4% ( -28% -   24%) 0.203
           HighIntervalsOrdered        6.48      (5.8%)        6.16      (5.9%)   -5.0% ( -15% -    7%) 0.007
                   OrNotHighLow     1349.40      (3.9%)     1284.48      (4.1%)   -4.8% ( -12% -    3%) 0.000
                   HighSpanNear        7.79      (2.6%)        7.42      (3.5%)   -4.8% ( -10% -    1%) 0.000
                       PKLookup      113.95      (3.7%)      108.74      (4.6%)   -4.6% ( -12% -    3%) 0.001
        AndHighMedDayTaxoFacets       44.93      (3.5%)       43.23      (6.7%)   -3.8% ( -13% -    6%) 0.025
                        Respell       35.71      (3.2%)       34.37      (5.6%)   -3.7% ( -12% -    5%) 0.009
                     AndHighMed       96.91      (6.8%)       93.83      (3.9%)   -3.2% ( -13% -    8%) 0.070
                     OrHighHigh       65.98      (8.2%)       64.12     (12.0%)   -2.8% ( -21% -   18%) 0.387
                     AndHighLow     1356.33      (3.6%)     1319.07      (5.2%)   -2.7% ( -11% -    6%) 0.052
                    AndHighHigh       66.89      (7.6%)       65.17      (6.9%)   -2.6% ( -15% -   12%) 0.263
         OrHighMedDayTaxoFacets        2.70      (7.1%)        2.63      (5.6%)   -2.5% ( -14% -   11%) 0.225
                     HighPhrase       80.60      (4.9%)       79.11      (4.6%)   -1.8% ( -10% -    8%) 0.221
       AndHighHighDayTaxoFacets       10.10      (4.0%)        9.91      (4.4%)   -1.8% (  -9% -    6%) 0.175
                       Wildcard      139.44      (2.7%)      137.27      (3.0%)   -1.6% (  -7% -    4%) 0.083
           HighTermTitleBDVSort        8.90      (4.0%)        8.76      (3.2%)   -1.6% (  -8% -    5%) 0.174
           MedTermDayTaxoFacets       16.02      (4.7%)       15.78      (3.2%)   -1.5% (  -9% -    6%) 0.225
                         Fuzzy1       58.33      (3.4%)       57.52      (4.5%)   -1.4% (  -8% -    6%) 0.274
                        Prefix3      495.32      (2.3%)      489.23      (1.9%)   -1.2% (  -5% -    3%) 0.063
                MedSloppyPhrase       36.19      (4.0%)       35.89      (2.5%)   -0.8% (  -7% -    5%) 0.428
          HighTermDayOfYearSort       60.62      (4.4%)       61.15      (4.2%)    0.9% (  -7% -    9%) 0.520
                    MedSpanNear       16.26      (4.8%)       16.55      (6.5%)    1.7% (  -9% -   13%) 0.341
            LowIntervalsOrdered       57.41      (5.4%)       58.49      (4.2%)    1.9% (  -7% -   12%) 0.224
                      OrHighMed      130.42      (5.2%)      133.04      (5.5%)    2.0% (  -8% -   13%) 0.238
                         IntNRQ       70.41      (3.8%)       71.85      (5.2%)    2.0% (  -6% -   11%) 0.153
              HighTermMonthSort     1101.95      (3.8%)     1129.04      (3.2%)    2.5% (  -4% -    9%) 0.025
                   OrHighNotMed       56.97      (3.3%)       58.71      (3.1%)    3.1% (  -3% -    9%) 0.003
                         Fuzzy2       49.17      (3.5%)       51.36      (4.3%)    4.5% (  -3% -   12%) 0.000
                      OrHighLow      290.33      (3.0%)      303.84      (2.3%)    4.7% (   0% -   10%) 0.000
                  OrHighNotHigh       88.66      (3.7%)       93.58      (2.9%)    5.6% (   0% -   12%) 0.000
                        LowTerm      209.03      (2.4%)      228.72      (2.9%)    9.4% (   4% -   15%) 0.000
                        MedTerm       73.56      (3.1%)       86.92      (3.7%)   18.2% (  10% -   25%) 0.000
                   OrNotHighMed      218.55      (2.1%)      267.79      (3.2%)   22.5% (  16% -   28%) 0.000
                  OrNotHighHigh       45.70      (2.9%)       57.84      (4.7%)   26.6% (  18% -   35%) 0.000
           BrowseDateSSDVFacets        0.60     (13.1%)        0.77     (17.4%)   29.2% (  -1% -   68%) 0.000
                   OrHighNotLow      163.54      (2.5%)      214.57      (2.9%)   31.2% (  25% -   37%) 0.000
                     TermDTSort       29.64      (4.0%)       39.09      (6.2%)   31.9% (  20% -   43%) 0.000
    BrowseRandomLabelTaxoFacets        1.92      (2.5%)        2.88     (46.6%)   49.7% (   0% -  101%) 0.000
           BrowseDateTaxoFacets        2.39      (6.1%)        3.79     (63.8%)   59.0% ( -10% -  137%) 0.000
      BrowseDayOfYearTaxoFacets        2.37      (5.4%)        3.85     (67.2%)   62.4% (  -9% -  142%) 0.000
          BrowseMonthTaxoFacets        2.49      (3.7%)        7.08     (93.6%)  183.8% (  83% -  291%) 0.000

Jul 02 '24 03:07 shubhamvishu

I'm seeing some crazy speedups for some tasks in the benchmarks (including PKLookup; a few got little slower) when using the new expression. Looking for you thoughts on this, Is my luceneutil changes right and could we expect this PR to affect other tasks than PKLookup as well like above?. These reported gains are so high that it got me a bit suspicious about the reported gains. I'd need an extra pair of eyes here.

@mikemccand @jpountz Any thoughts ?

do you see any gains vs main with the bug fixed (added parens)?

When not using the new expression i..e using "()" as we normally expect (below code changes) I still see high speedups for some tasks in the benchmarks(earlier comment) but this time PKLookup regresses ~4-5%with this change. ``

@@ -151,8 +151,8 @@ public class FuzzySet implements Accountable {
   public ContainsResult contains(BytesRef value) {
     long[] hash = StringHelper.murmurhash3_x64_128(value);
 
-    int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1;
-    int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;
+    int msb = (((int) hash[0] >>> Integer.SIZE) >>> 1) + (((int) hash[1] >>> Integer.SIZE) >>> 1);
+    int lsb = (((int) hash[0]) >>> 1) + (((int) hash[1]) >>> 1);
     for (int i = 0; i < hashCount; i++) {
       int bloomPos = (lsb + i * msb);
       if (!mayContainValue(bloomPos)) {
@@ -219,8 +219,8 @@ public class FuzzySet implements Accountable {
    */
   public void addValue(BytesRef value) {
     long[] hash = StringHelper.murmurhash3_x64_128(value);
-    int msb = ((int) hash[0] >>> Integer.SIZE) >>> 1 + ((int) hash[1] >>> Integer.SIZE) >>> 1;
-    int lsb = ((int) hash[0]) >>> 1 + ((int) hash[1]) >>> 1;
+    int msb = (((int) hash[0] >>> Integer.SIZE) >>> 1) + (((int) hash[1] >>> Integer.SIZE) >>> 1);
+    int lsb = (((int) hash[0]) >>> 1) + (((int) hash[1]) >>> 1);
     for (int i = 0; i < hashCount; i++) {
       // Bitmasking using bloomSize is effectively a modulo operation.
       int bloomPos = (lsb + i * msb) & bloomSize;

Could we maybe add a random test to assert that your impl in BloomPostingsFormat matches StringUtil's murmur3?

@mikemccand Do you mean to also change the expression in StringHelper#murmurhash3_x86_32 also to the new one and assert that? Or maybe you mean we should assert using a test that the new expression in FuzzySet is not altered later?

Jul 02 '24 03:07 shubhamvishu

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jul 17 '24 00:07 github-actions[bot]

@mikemccand @jpountz Bumping this up. I don't know if there are any major concerns here? If not, I'd we awesome to include this also for 10.0. Looking for your thoughts? Thanks!

Sep 10 '24 16:09 shubhamvishu

Thanks for the reminder @shubhamvishu!

I'm seeing some crazy speedups for some tasks in the benchmarks (including PKLookup; a few got little slower) when using the new expression.

Hmm did you post the full results somewhere?

Sep 11 '24 17:09 mikemccand

I'm seeing some crazy speedups for some tasks in the benchmarks (including PKLookup; a few got little slower) when using the new expression.

Hmm did you post the full results somewhere?

OK sorry I see them now. I have to click on those sophisticated arrows to expand the results, heh.

Sep 11 '24 17:09 mikemccand

@jpountz I have addressed your comments now and kept the bit mixing logic simple as proposed initially. Let me know if the change looks good now. Thanks!

Sep 22 '24 11:09 shubhamvishu

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Oct 08 '24 00:10 github-actions[bot]

@jpountz I simplified the expression now. Let me know if the change looks good? Thanks!

Oct 14 '24 10:10 shubhamvishu

BTW, why StringHelper is a abstract class? Can we make it final?

Oct 15 '24 02:10 vsop-479

I think yes we can make it final as it has no abstract methods so need to have it abstract. Git blame says it was made abstract >20 years ago by Doug so maybe it just stayed like this since.

Oct 17 '24 22:10 shubhamvishu

Thanks @shubhamvishu , I opend https://github.com/apache/lucene/pull/13928.

Oct 18 '24 01:10 vsop-479

lucene lucene copied to clipboard

Try using Murmurhash 3 for bloom filters

Description

lucene
lucene copied to clipboard