lucene gh-13147: use dense bit-encoding for frequent terms

This is a proof-of-concept of encoding postings using dense bitmaps that encode both docs that have (with 1s) and don't have (with 0s) a term in them. It has various nocommits and it is still TBD whether this is worth pursuing. See #13147

Mar 02 '24 19:03 msokolov

Have you seen interesting performance numbers with this change?

Mar 13 '24 11:03 jpountz

right that last change seemed promising! But on luceneutil tasks it didn't show much impact, basically a regression to teh mean compared to the first revision I tested; possibly a slight negative trend overall. I'll paste the results here. I was considering creating a more targeted benchmark to exaggerate the impact using a field and query type where we would expect to see a benefit, if only as a debugging tool, but I haven't had a chance to do that yet. Another idea I had was to try to help branch prediction by using a specialized BlockDecoder or so rather than embedding if statements in every advance/nextDoc call. I'm also not sure whether writing full longs versus truncating the final long in the bitset and writing only as many bytes as needed (added in the recent revision) might be making some difference? Seems unlikely.

Mar 14 '24 14:03 msokolov

It seems to especially make phrase and span queries slower? Possibly the decoding of positions is not good? I could try restricting to fields without positions (which seem likely to be the ones that would benefit anyway)

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value                         [87/1978]
                      HighPhrase        4.62      (4.1%)        4.45      (3.9%)   -3.7% ( -11% -    4%) 0.004
                       MedPhrase       67.01      (1.9%)       65.37      (2.0%)   -2.4% (  -6% -    1%) 0.000
                          IntNRQ       15.92     (13.8%)       15.53     (13.2%)   -2.4% ( -25% -   28%) 0.569
                HighSloppyPhrase        3.60      (2.4%)        3.52      (2.8%)   -2.3% (  -7% -    2%) 0.005
                       LowPhrase        9.44      (2.5%)        9.23      (2.4%)   -2.2% (  -6% -    2%) 0.004
        AndHighHighDayTaxoFacets        2.92      (2.3%)        2.88      (3.5%)   -1.5% (  -7% -    4%) 0.108
                 LowSloppyPhrase       26.88      (2.5%)       26.48      (2.5%)   -1.5% (  -6% -    3%) 0.065                                               
                 MedSloppyPhrase       18.10      (3.2%)       17.85      (3.5%)   -1.4% (  -7% -    5%) 0.191                                               
           HighTermDayOfYearSort      156.73      (2.2%)      154.64      (2.8%)   -1.3% (  -6% -    3%) 0.096                                               
                      TermDTSort       69.58      (2.6%)       68.83      (2.9%)   -1.1% (  -6% -    4%) 0.216                                               
         AndHighMedDayTaxoFacets       15.06      (0.9%)       14.91      (1.6%)   -1.0% (  -3% -    1%) 0.015                                               
                     MedSpanNear       31.29      (1.5%)       30.97      (1.1%)   -1.0% (  -3% -    1%) 0.017                                               
            HighIntervalsOrdered        3.92      (4.2%)        3.88      (4.5%)   -1.0% (  -9% -    8%) 0.465                                               
             LowIntervalsOrdered        1.42      (3.0%)        1.41      (3.0%)   -0.9% (  -6% -    5%) 0.335   
                     LowSpanNear       14.23      (1.3%)       14.10      (0.8%)   -0.9% (  -2% -    1%) 0.009               
                    OrNotHighMed      207.12      (5.3%)      205.50      (5.8%)   -0.8% ( -11% -   10%) 0.655
                   OrNotHighHigh      126.12      (5.2%)      125.14      (5.8%)   -0.8% ( -11% -   10%) 0.655         
               HighTermTitleSort       79.33      (2.5%)       78.82      (2.0%)   -0.6% (  -5% -    4%) 0.376  
            BrowseDateTaxoFacets        2.73      (5.7%)        2.71      (6.9%)   -0.6% ( -12% -   12%) 0.752
     BrowseRandomLabelTaxoFacets        2.11     (11.5%)        2.10     (11.9%)   -0.6% ( -21% -   25%) 0.868
             MedIntervalsOrdered        3.17      (3.9%)        3.15      (4.4%)   -0.5% (  -8% -    8%) 0.694    
                    HighSpanNear        3.15      (3.1%)        3.14      (2.2%)   -0.5% (  -5% -    4%) 0.576
                       OrHighLow      247.95      (2.2%)      246.79      (1.9%)   -0.5% (  -4% -    3%) 0.467
                      AndHighLow      396.47      (2.2%)      394.68      (1.5%)   -0.4% (  -4% -    3%) 0.460
                          Fuzzy1       65.30      (1.5%)       65.01      (1.8%)   -0.4% (  -3% -    2%) 0.402
       BrowseDayOfYearTaxoFacets        2.77      (5.4%)        2.76      (6.2%)   -0.4% ( -11% -   11%) 0.817
                       OrHighMed      142.70      (2.1%)      142.13      (2.0%)   -0.4% (  -4% -    3%) 0.543
                    OrHighNotMed      279.04      (5.4%)      277.96      (6.2%)   -0.4% ( -11% -   11%) 0.832  
           BrowseMonthSSDVFacets        2.38      (2.7%)        2.37      (2.9%)   -0.4% (  -5% -    5%) 0.663    
                          Fuzzy2       50.49      (1.3%)       50.32      (1.4%)   -0.3% (  -3% -    2%) 0.420                                               
                      OrHighHigh       21.69      (2.4%)       21.62      (3.6%)   -0.3% (  -6% -    5%) 0.746
                         Respell       32.27      (2.4%)       32.19      (2.2%)   -0.2% (  -4% -    4%) 0.761
           BrowseMonthTaxoFacets        2.87      (1.4%)        2.86      (1.2%)   -0.2% (  -2% -    2%) 0.654                                               
                        PKLookup      142.74      (2.4%)      142.50      (1.9%)   -0.2% (  -4% -    4%) 0.809         
                      AndHighMed       53.54      (2.3%)       53.45      (2.2%)   -0.2% (  -4% -    4%) 0.829         
       BrowseDayOfYearSSDVFacets        2.33      (2.4%)        2.32      (2.7%)   -0.1% (  -5% -    5%) 0.872
            MedTermDayTaxoFacets        6.54      (4.4%)        6.53      (3.9%)   -0.1% (  -8% -    8%) 0.925
                    OrNotHighLow      248.01      (1.8%)      247.71      (1.7%)   -0.1% (  -3% -    3%) 0.825
     BrowseRandomLabelSSDVFacets        1.61      (2.9%)        1.61      (3.4%)   -0.1% (  -6% -    6%) 0.941
          OrHighMedDayTaxoFacets        1.26      (5.9%)        1.26      (6.7%)   -0.1% ( -11% -   13%) 0.978     
                        Wildcard      362.00      (2.1%)      362.14      (2.1%)    0.0% (  -4% -    4%) 0.953
            HighTermTitleBDVSort        3.22      (6.1%)        3.22      (6.6%)    0.1% ( -11% -   13%) 0.957
                   OrHighNotHigh      137.13      (6.5%)      137.32      (7.0%)    0.1% ( -12% -   14%) 0.948
                     AndHighHigh       13.35      (4.2%)       13.37      (4.5%)    0.2% (  -8% -    9%) 0.903
                    OrHighNotLow      198.64      (7.3%)      199.07      (7.7%)    0.2% ( -13% -   16%) 0.926
                         LowTerm      269.15      (2.5%)      270.97      (4.5%)    0.7% (  -6% -    7%) 0.558
                         Prefix3      231.85      (2.7%)      234.29      (1.8%)    1.1% (  -3% -    5%) 0.143
               HighTermMonthSort     1592.67      (3.3%)     1611.63      (3.0%)    1.2% (  -4% -    7%) 0.233
                         MedTerm      277.73      (5.0%)      282.57      (8.5%)    1.7% ( -11% -   15%) 0.428
                        HighTerm      192.36      (5.2%)      196.16      (9.1%)    2.0% ( -11% -   17%) 0.400
            BrowseDateSSDVFacets        0.68      (7.3%)        0.70      (5.6%)    2.2% (  -9% -   16%) 0.284

Mar 14 '24 15:03 msokolov

after disabling this for fields with positions, luceneutil perf looks pretty flat. I think it simply doesn't have any test cases that would exercise this. I wrote a small benchmark that indexes a lot of dense terms of different frequencies and then runs top 100 searches using randomly-generated boolean queries on them designed that exploit those fields. This does seem to show a significant difference: index size reduction of 25% and 30% improvement in QPS. This test is artificial; there's a fair amount of variance (although I used a fixed random seed and multiple runs), and the index is pretty small. But I think it shows promise, so I'll try to figure out how to get a wikipedia-based test in to luceneutil. Not sure what realistic data there is to support it in linedocs. Perhaps if we index Month and Year as docs-only fields we would see some impact on queries with those as filters?

Mar 18 '24 16:03 msokolov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Apr 02 '24 00:04 github-actions[bot]

Perhaps if we index Month and Year as docs-only fields we would see some impact on queries with those as filters?

+1 -- did that show any measurable performance change?

I'm very surprised it's hard to measure gains from a dense postings encoding! It ought to be super fast when it can apply: O(1) skipping time. Admittedly our postings APIs are designed for iteration so there is some API overhead due to that (you have to scan for next set bit, but, Java has fast primitives that get down to the silicon for this, right)?

May 10 '24 16:05 mikemccand

+1 -- did that show any measurable performance change?

Well, sort of -- I did index months to get some dense postings field and added "month tasks" by tacking on +month:Mar and so on to existing tasks. A few tasks created that way showed some improvement, but most showed no change or substantially worse performance. I concluded that it's hard to get the gains without getting in the way of "normal" postings advancement?

The only thing I've done that showed a clear improvement was a microbenchmark where I created an index composed entirely of these doc-only fields and queried it with random boolean queries involving terms in those fields.

May 10 '24 17:05 msokolov

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

May 26 '24 00:05 github-actions[bot]

lucene lucene copied to clipboard

gh-13147: use dense bit-encoding for frequent terms

lucene
lucene copied to clipboard