lucene Random access term dictionary

Description

Related issue https://github.com/apache/lucene/issues/12513

Opening this PR early to avoid massive diffs in one-shot

[x] Encode (term type, local ord) in FST
[x] Encode/Decode term states with bitpacking/unpacking.

TODO:

[x] Implement bit-packing and unpacking for each term type
[x] Implement the PostingsFormat

Oct 16 '23 22:10 Tony-X

I'll try to review this soon -- it sounds compelling @Tony-X! I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM).

Also, with the upcoming ability to cleanly limit how much RAM the FSTCompiler is allowed to use to reduce the size of the FST, this approach becomes more feasible. Without that change, the FST compilation might easily use excessive RAM during indexing when merging large segments.

Oct 19 '23 09:10 mikemccand

I'll also try to review! On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested.

Oct 20 '23 11:10 bruno-roustant

Thanks @bruno-roustant ! If you're okay to share it feel free to share it here.

I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl.

Oct 20 '23 17:10 Tony-X

It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec= and Lucene will run all tests cases using your Codec.

Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?

Nov 03 '23 18:11 Tony-X

Just realized that we have lucene99 Codec out! I'll update the code to reflect that as this posting format aims to work with the latest Codec.

Nov 06 '23 22:11 Tony-X

Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?

You don't really need a new Codec -- you could test this easily by subclassing Lucene99Codec and overriding getPostingsFormatForField to use your new cool bit packing PostingsFormat.

Nov 13 '23 12:11 mikemccand

After some tweaking and tinkering I was finally able to get the bench running the way I wanted in luceneutil! @mikemccand

Unfortunately luceneutil out of the box doesn't work for my case...

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value                                    
                        Wildcard       56.20      (2.0%)        7.30      (0.2%)  -87.0% ( -87% -  -86%) 0.000
                         Respell       44.00      (1.7%)       14.46      (0.6%)  -67.1% ( -68% -  -65%) 0.000
                          Fuzzy1       54.11      (1.2%)       20.86      (0.8%)  -61.4% ( -62% -  -60%) 0.000
                         Prefix3      103.61      (0.8%)       41.37      (0.5%)  -60.1% ( -60% -  -59%) 0.000
                          Fuzzy2       42.43      (1.2%)       20.26      (0.9%)  -52.3% ( -53% -  -50%) 0.000
               HighTermTitleSort      127.60      (1.7%)      114.65      (1.6%)  -10.1% ( -13% -   -6%) 0.000
               HighTermMonthSort     2549.22      (2.8%)     2435.74      (3.8%)   -4.5% ( -10% -    2%) 0.037
                      AndHighLow      708.99      (2.3%)      678.15      (2.1%)   -4.4% (  -8% -    0%) 0.001
                         LowTerm      369.16      (5.5%)      358.15      (3.1%)   -3.0% ( -10% -    5%) 0.287
                    OrNotHighLow      258.51      (1.8%)      252.11      (2.2%)   -2.5% (  -6% -    1%) 0.054
                       OrHighLow      348.58      (1.1%)      340.11      (2.5%)   -2.4% (  -5% -    1%) 0.046
                   OrNotHighHigh      141.31      (6.7%)      138.54      (3.2%)   -2.0% ( -11% -    8%) 0.554
                          IntNRQ       17.93      (1.8%)       17.62      (3.7%)   -1.7% (  -7% -    3%) 0.349
                 MedSloppyPhrase       26.73      (0.8%)       26.28      (1.3%)   -1.7% (  -3% -    0%) 0.017
            HighIntervalsOrdered        3.88      (2.6%)        3.82      (2.3%)   -1.6% (  -6% -    3%) 0.314
                        HighTerm      290.77      (7.3%)      286.20      (6.9%)   -1.6% ( -14% -   13%) 0.727
                   OrHighNotHigh      296.51      (6.9%)      291.90      (4.9%)   -1.6% ( -12% -   11%) 0.682
             LowIntervalsOrdered       15.48      (1.1%)       15.26      (1.3%)   -1.4% (  -3% -    0%) 0.057
                         MedTerm      405.54      (6.3%)      399.77      (5.9%)   -1.4% ( -12% -   11%) 0.713
                 LowSloppyPhrase       44.64      (0.5%)       44.04      (2.1%)   -1.3% (  -3% -    1%) 0.171
           HighTermDayOfYearSort      177.36      (1.6%)      175.08      (1.8%)   -1.3% (  -4% -    2%) 0.234
                       OrHighMed       79.44      (3.7%)       78.48      (3.0%)   -1.2% (  -7% -    5%) 0.572
                    OrNotHighMed      268.82      (4.3%)      265.74      (3.1%)   -1.1% (  -8% -    6%) 0.632
           BrowseMonthTaxoFacets        3.89      (0.4%)        3.85      (1.2%)   -0.9% (  -2% -    0%) 0.134
         AndHighMedDayTaxoFacets       44.33      (0.7%)       43.96      (1.1%)   -0.8% (  -2% -    0%) 0.145
                     MedSpanNear       30.67      (1.1%)       30.42      (2.3%)   -0.8% (  -4% -    2%) 0.481
                     LowSpanNear        4.56      (0.8%)        4.53      (2.1%)   -0.6% (  -3% -    2%) 0.576
                    HighSpanNear        8.52      (1.4%)        8.47      (2.1%)   -0.5% (  -3% -    2%) 0.636
                    OrHighNotLow      236.32      (6.6%)      235.28      (4.4%)   -0.4% ( -10% -   11%) 0.901
        AndHighHighDayTaxoFacets        4.31      (0.6%)        4.29      (0.8%)   -0.4% (  -1% -    1%) 0.446
                       LowPhrase       69.90      (1.2%)       69.67      (2.7%)   -0.3% (  -4% -    3%) 0.807
                      AndHighMed       41.87      (0.6%)       41.75      (2.7%)   -0.3% (  -3% -    3%) 0.828
                      OrHighHigh       45.86      (7.4%)       45.77      (7.9%)   -0.2% ( -14% -   16%) 0.969
                      TermDTSort      101.90      (2.3%)      101.84      (2.0%)   -0.1% (  -4% -    4%) 0.966
                HighSloppyPhrase        0.48      (2.2%)        0.48      (2.6%)   -0.0% (  -4% -    4%) 0.995
            BrowseDateSSDVFacets        1.00      (3.7%)        1.00      (5.3%)   -0.0% (  -8% -    9%) 1.000
            HighTermTitleBDVSort        4.48      (5.4%)        4.49      (4.4%)    0.0% (  -9% -   10%) 0.990
       BrowseDayOfYearSSDVFacets        3.51     (13.4%)        3.52     (13.3%)    0.1% ( -23% -   30%) 0.990
           BrowseMonthSSDVFacets        3.38      (0.7%)        3.39      (0.6%)    0.1% (  -1% -    1%) 0.740
       BrowseDayOfYearTaxoFacets        3.85      (0.3%)        3.85      (0.5%)    0.2% (   0% -    0%) 0.469
             MedIntervalsOrdered        1.65      (1.5%)        1.65      (1.5%)    0.2% (  -2% -    3%) 0.802
            BrowseDateTaxoFacets        3.82      (0.3%)        3.83      (0.4%)    0.3% (   0% -    0%) 0.264
     BrowseRandomLabelSSDVFacets        2.34      (1.0%)        2.35      (0.5%)    0.3% (  -1% -    1%) 0.494
                    OrHighNotMed      296.34      (7.6%)      297.38      (4.6%)    0.4% ( -11% -   13%) 0.930
     BrowseRandomLabelTaxoFacets        3.29      (0.6%)        3.31      (0.7%)    0.5% (   0% -    1%) 0.275
                       MedPhrase        9.84      (2.1%)        9.90      (4.1%)    0.6% (  -5% -    6%) 0.787
                     AndHighHigh       33.50      (1.2%)       33.73      (3.4%)    0.7% (  -3% -    5%) 0.669
                      HighPhrase       15.15      (2.3%)       15.28      (4.1%)    0.9% (  -5% -    7%) 0.679
            MedTermDayTaxoFacets       13.36      (1.8%)       13.55      (2.0%)    1.4% (  -2% -    5%) 0.240
          OrHighMedDayTaxoFacets        3.73      (1.4%)        3.83      (2.2%)    2.6% (   0% -    6%) 0.026
                        PKLookup      147.84      (1.6%)      157.37      (1.5%)    6.4% (   3% -    9%) 0.000

Observations:

PKLookup has improvement

This is reasonable as the terms index (FST) holds all the terms.

Fuzzy/Wildcard/Prefix queries got much slower

This is also expected because currently I used the default implementation provided by TermsEnum which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.

`HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput

I don't quite understand why term lookup could affect sorting on a DV field

`AndHighLow` got slower

Am i missing some optimization opportunity for low freq terms?

Nov 24 '23 07:11 Tony-X

This is reasonable as the terms index (FST) holds all the terms.

+1, nice!

Fuzzy/Wildcard/Prefix queries got much slower

This is also expected because currently I used the default implementation provided by TermsEnum which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.

OK this makes sense, and it is a (sad) measure of how slow the emulated (on top of seekCeil) .intersect TermsEnum is. Once you have an optimized version it should likely be faster than block tree since it can intersect all suffixes instead of scanning byte[] suffixes in the term block and re-testing each.

HighTermTitleSort and HighTermMonthSort got about 4.5% ~ 10% less throughput

I don't quite understand why term lookup could affect sorting on a DV field

This is odd. Though, the HighTermMonthSort QPS is so crazy high as to not really be trustworthy -- likely BMW is kicking in and saving tons of work.

AndHighLow got slower

Am i missing some optimization opportunity for low freq terms?

Hmm maybe pulsing? Are we still inlining single-occurrence terms directly into the terms dict with your new terms dict?

Nov 28 '23 16:11 mikemccand

Since the first working version, I iterated with a list of profiling-guided allocation optimizations, as they stood out quite obviously from the merged JFR reports (thanks to luceneutil !).

Some of them comes from my code that implements the term dictionary data lookup, and a few of them are at more general Lucene level. I want to highlight the general issue I see from this work and maybe we can have separate issues to improve them!

Here is the first heap profile comparison (search-only, no indexing).

Candidate Heap
17.50%        24440M        java.lang.Long#valueOf()
10.09%        14096M        jdk.internal.misc.Unsafe#allocateUninitializedArray()
6.87%         9594M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
4.40%         6140M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
...

main
13.65%        11898M        org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
9.26%         8071M         org.apache.lucene.util.FixedBitSet#<init>()
6.70%         5836M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
6.60%         5751M         org.apache.lucene.util.ArrayUtil#growExact()
5.21%         4541M         org.apache.lucene.facet.FacetsConfig#stringToPath()
4.69%         4090M         org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()

FST doesn't play nicely with primitive types (I know, this is more or less a java issue)

24440M java.lang.Long#valueOf() huge amount of allocations... This is obvious. The FST<T> implementation is generic over its output type and in my case T is Long. So for trivial long add and subtract, the implementation would allocate an object. Not only it is wasteful but from a perf perspective it'd be less than 1 CPU cycle v.s. calling allocator which is easily tens if not hundreds of cycles.

For this work, I forked the FST<T> class and manually templated it with long just to see how much difference it makes. Here is a diff in heap profile and bench results before and after.

Before 

PERCENT       HEAP SAMPLES  STACK                                                                                                                              
25.97%        32791M        java.lang.Long#valueOf()
7.58%         9571M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13%         6482M         org.apache.lucene.util.FixedBitSet#<init>()
4.90%     
....

After
PERCENT       HEAP SAMPLES  STACK
8.44%         7988M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
7.17%         6788M         org.apache.lucene.util.FixedBitSet#<init>()
6.22%         5886M         org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
5.89%         5577M         org.apache.lucene.util.ArrayUtil#growExact()

Bench diff

Before

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       11.61      (2.7%)        2.40      (0.6%)  -79.4% ( -80% -  -78%) 0.000
                          Fuzzy1       78.17      (0.7%)       27.16      (0.9%)  -65.3% ( -66% -  -64%) 0.000
                         Respell       29.09      (0.6%)       10.91      (0.8%)  -62.5% ( -63% -  -61%) 0.000
                          Fuzzy2       47.80      (0.6%)       18.50      (1.2%)  -61.3% ( -62% -  -59%) 0.000
                         Prefix3      765.08      (3.1%)      463.94      (0.9%)  -39.4% ( -42% -  -36%) 0.000
               HighTermTitleSort       98.48      (2.0%)       90.62      (2.2%)   -8.0% ( -11% -   -3%) 0.000
           BrowseMonthTaxoFacets        3.89     (29.2%)        3.62      (0.9%)   -6.9% ( -28% -   32%) 0.293
                 LowSloppyPhrase       22.73      (6.5%)       22.35      (6.9%)   -1.7% ( -14% -   12%) 0.432
                         LowTerm      365.47      (3.6%)      359.62      (2.9%)   -1.6% (  -7% -    5%) 0.121
                        HighTerm      398.57      (5.1%)      393.16      (4.7%)   -1.4% ( -10% -    8%) 0.380
                 MedSloppyPhrase       10.63      (3.6%)       10.51      (3.7%)   -1.1% (  -8% -    6%) 0.339
                         MedTerm      422.73      (4.2%)      418.60      (4.0%)   -1.0% (  -8% -    7%) 0.451
            MedTermDayTaxoFacets       14.84      (2.6%)       14.71      (2.5%)   -0.8% (  -5% -    4%) 0.296
                HighSloppyPhrase       12.41      (3.1%)       12.33      (3.1%)   -0.7% (  -6% -    5%) 0.487
            HighTermTitleBDVSort        6.88      (3.3%)        6.84      (3.5%)   -0.6% (  -7% -    6%) 0.599
                       LowPhrase       58.15      (2.9%)       57.85      (2.8%)   -0.5% (  -6% -    5%) 0.567
       BrowseDayOfYearSSDVFacets        3.24      (0.4%)        3.23      (0.5%)   -0.3% (  -1% -    0%) 0.042
                       MedPhrase       26.19      (3.1%)       26.11      (3.2%)   -0.3% (  -6% -    6%) 0.775
                    OrNotHighMed      185.23      (3.9%)      184.73      (3.3%)   -0.3% (  -7% -    7%) 0.813
          OrHighMedDayTaxoFacets        3.82      (3.3%)        3.81      (3.2%)   -0.3% (  -6% -    6%) 0.796
                    OrHighNotLow      194.98      (5.1%)      194.51      (4.6%)   -0.2% (  -9% -   10%) 0.875
                    OrHighNotMed      337.15      (4.4%)      336.53      (3.8%)   -0.2% (  -7% -    8%) 0.888
                          IntNRQ       67.60      (0.9%)       67.55      (1.0%)   -0.1% (  -1% -    1%) 0.783
                     MedSpanNear        9.85      (1.4%)        9.84      (2.1%)   -0.1% (  -3% -    3%) 0.906
                   OrNotHighHigh      205.12      (4.1%)      205.01      (3.9%)   -0.1% (  -7% -    8%) 0.967
        AndHighHighDayTaxoFacets        6.35      (1.5%)        6.34      (1.7%)   -0.0% (  -3% -    3%) 0.932
           BrowseMonthSSDVFacets        3.29      (0.8%)        3.29      (0.7%)   -0.0% (  -1% -    1%) 0.887
     BrowseRandomLabelSSDVFacets        2.30      (0.8%)        2.30      (1.0%)    0.0% (  -1% -    1%) 0.919
                     LowSpanNear       16.41      (2.6%)       16.42      (2.7%)    0.1% (  -5% -    5%) 0.931
                      HighPhrase       77.12      (3.0%)       77.20      (3.6%)    0.1% (  -6% -    6%) 0.923
         AndHighMedDayTaxoFacets       39.64      (1.2%)       39.68      (1.0%)    0.1% (  -2% -    2%) 0.742
     BrowseRandomLabelTaxoFacets        3.19      (1.6%)        3.19      (1.1%)    0.1% (  -2% -    2%) 0.728
            BrowseDateTaxoFacets        3.73      (0.7%)        3.74      (0.5%)    0.3% (   0% -    1%) 0.157
                     AndHighHigh       27.08      (1.3%)       27.15      (3.0%)    0.3% (  -4% -    4%) 0.718
       BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.77      (0.5%)    0.3% (   0% -    1%) 0.072
           HighTermDayOfYearSort      224.01      (2.1%)      224.81      (2.1%)    0.4% (  -3% -    4%) 0.592
                    HighSpanNear        6.09      (2.7%)        6.11      (3.1%)    0.4% (  -5% -    6%) 0.683
            HighIntervalsOrdered        8.08      (3.3%)        8.11      (3.4%)    0.4% (  -6% -    7%) 0.705
                      TermDTSort      103.29      (4.4%)      103.83      (3.1%)    0.5% (  -6% -    8%) 0.669
             MedIntervalsOrdered       33.12      (4.4%)       33.29      (4.6%)    0.5% (  -8% -    9%) 0.702
             LowIntervalsOrdered       10.06      (3.9%)       10.12      (3.6%)    0.6% (  -6% -    8%) 0.609
                      AndHighMed       73.71      (2.2%)       74.18      (2.5%)    0.6% (  -3% -    5%) 0.394
                       OrHighMed       71.44      (2.7%)       71.98      (3.3%)    0.7% (  -5% -    6%) 0.429
            BrowseDateSSDVFacets        0.96      (4.8%)        0.97      (5.7%)    0.9% (  -9% -   11%) 0.601
                   OrHighNotHigh      308.82      (4.0%)      311.53      (3.7%)    0.9% (  -6% -    8%) 0.470
                       OrHighLow      404.69      (3.0%)      408.63      (3.5%)    1.0% (  -5% -    7%) 0.348
                      OrHighHigh       20.44      (4.7%)       20.73      (7.2%)    1.4% ( -10% -   13%) 0.469
                    OrNotHighLow      381.28      (1.8%)      388.18      (2.1%)    1.8% (  -2% -    5%) 0.004
               HighTermMonthSort     2500.04      (2.2%)     2554.91      (4.3%)    2.2% (  -4% -    8%) 0.042
                      AndHighLow      668.12      (3.1%)      692.04      (3.9%)    3.6% (  -3% -   10%) 0.001
                        PKLookup      140.25      (2.0%)      168.53      (1.9%)   20.2% (  15% -   24%) 0.000

After

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       54.96      (2.6%)       10.43      (0.5%)  -81.0% ( -82% -  -80%) 0.000
                         Respell       45.54      (1.0%)       16.74      (0.7%)  -63.2% ( -64% -  -62%) 0.000
                          Fuzzy1       46.41      (1.2%)       17.26      (1.0%)  -62.8% ( -64% -  -61%) 0.000
                         Prefix3      121.65      (2.4%)       55.57      (0.9%)  -54.3% ( -56% -  -52%) 0.000
                          Fuzzy2       32.33      (1.2%)       15.79      (1.1%)  -51.2% ( -52% -  -49%) 0.000
               HighTermTitleSort       95.24      (2.1%)       87.04      (1.9%)   -8.6% ( -12% -   -4%) 0.000
     BrowseRandomLabelSSDVFacets        2.37      (7.1%)        2.33      (4.8%)   -1.7% ( -12% -   10%) 0.374
           BrowseMonthSSDVFacets        3.34      (7.3%)        3.29      (0.6%)   -1.5% (  -8% -    6%) 0.362
                      TermDTSort      120.57      (2.3%)      119.02      (3.4%)   -1.3% (  -6% -    4%) 0.163
                      OrHighHigh       19.13      (5.6%)       18.92      (2.7%)   -1.1% (  -8% -    7%) 0.430
                     AndHighHigh       22.04      (5.1%)       21.87      (3.0%)   -0.8% (  -8% -    7%) 0.555
                      AndHighMed       55.06      (3.0%)       54.79      (2.1%)   -0.5% (  -5% -    4%) 0.546
                    HighSpanNear        3.29      (1.6%)        3.28      (1.7%)   -0.5% (  -3% -    2%) 0.346
            HighIntervalsOrdered        0.65      (1.8%)        0.65      (2.0%)   -0.5% (  -4% -    3%) 0.433
           HighTermDayOfYearSort      282.86      (2.0%)      281.57      (2.6%)   -0.5% (  -4% -    4%) 0.533
             MedIntervalsOrdered       16.36      (1.5%)       16.29      (1.5%)   -0.4% (  -3% -    2%) 0.369
                       OrHighMed       68.27      (2.9%)       67.99      (1.8%)   -0.4% (  -5% -    4%) 0.598
                     MedSpanNear        3.22      (1.0%)        3.21      (1.4%)   -0.4% (  -2% -    2%) 0.317
                HighSloppyPhrase        9.59      (2.5%)        9.57      (2.6%)   -0.3% (  -5% -    4%) 0.733
           BrowseMonthTaxoFacets        3.64      (2.4%)        3.63      (1.8%)   -0.2% (  -4% -    4%) 0.756
             LowIntervalsOrdered       14.66      (0.9%)       14.63      (1.5%)   -0.2% (  -2% -    2%) 0.633
            MedTermDayTaxoFacets       15.56      (2.7%)       15.54      (3.9%)   -0.2% (  -6% -    6%) 0.879
         AndHighMedDayTaxoFacets       18.70      (1.4%)       18.67      (3.7%)   -0.2% (  -5% -    5%) 0.864
                     LowSpanNear        4.39      (1.1%)        4.38      (1.4%)   -0.1% (  -2% -    2%) 0.728
          OrHighMedDayTaxoFacets        5.38      (3.5%)        5.38      (5.4%)   -0.1% (  -8% -    9%) 0.945
        AndHighHighDayTaxoFacets        7.06      (1.6%)        7.06      (3.0%)   -0.1% (  -4% -    4%) 0.924
                 LowSloppyPhrase        7.16      (1.4%)        7.15      (1.6%)   -0.1% (  -2% -    2%) 0.891
                 MedSloppyPhrase      128.54      (1.9%)      128.56      (2.2%)    0.0% (  -4% -    4%) 0.979
                         LowTerm      417.80      (3.3%)      418.01      (2.7%)    0.1% (  -5% -    6%) 0.958
                       LowPhrase      125.59      (4.0%)      125.77      (3.1%)    0.1% (  -6% -    7%) 0.900
                       OrHighLow      313.22      (2.1%)      313.72      (2.2%)    0.2% (  -4% -    4%) 0.817
            BrowseDateTaxoFacets        3.73      (0.7%)        3.74      (0.7%)    0.2% (  -1% -    1%) 0.470
       BrowseDayOfYearTaxoFacets        3.76      (0.7%)        3.76      (0.7%)    0.2% (  -1% -    1%) 0.457
                         MedTerm      384.57      (4.6%)      385.44      (3.6%)    0.2% (  -7% -    8%) 0.863
                   OrHighNotHigh      255.07      (4.3%)      256.05      (4.3%)    0.4% (  -7% -    9%) 0.778
                       MedPhrase       11.17      (3.0%)       11.21      (2.6%)    0.4% (  -5% -    6%) 0.658
                        HighTerm      361.26      (5.1%)      362.86      (4.2%)    0.4% (  -8% -   10%) 0.764
     BrowseRandomLabelTaxoFacets        3.19      (1.5%)        3.20      (0.6%)    0.5% (  -1% -    2%) 0.203
                   OrNotHighHigh      205.38      (4.0%)      206.35      (4.0%)    0.5% (  -7% -    8%) 0.712
                    OrNotHighLow      317.96      (1.7%)      319.48      (2.1%)    0.5% (  -3% -    4%) 0.428
                      HighPhrase       47.91      (3.8%)       48.15      (3.3%)    0.5% (  -6% -    7%) 0.661
            BrowseDateSSDVFacets        0.97      (6.9%)        0.98      (6.7%)    0.5% ( -12% -   15%) 0.801
                    OrHighNotLow      185.96      (4.9%)      187.04      (5.0%)    0.6% (  -8% -   11%) 0.710
       BrowseDayOfYearSSDVFacets        3.21      (2.1%)        3.23      (0.9%)    0.6% (  -2% -    3%) 0.225
            HighTermTitleBDVSort        5.83      (3.7%)        5.87      (4.0%)    0.7% (  -6% -    8%) 0.584
                    OrNotHighMed      516.84      (2.5%)      520.76      (2.5%)    0.8% (  -4% -    5%) 0.334
                          IntNRQ       29.24      (3.0%)       29.50      (4.1%)    0.9% (  -6% -    8%) 0.425
                    OrHighNotMed      268.45      (4.4%)      270.92      (4.2%)    0.9% (  -7% -    9%) 0.501
               HighTermMonthSort     2498.46      (4.8%)     2590.43      (3.7%)    3.7% (  -4% -   12%) 0.007
                      AndHighLow      747.94      (2.1%)      775.60      (4.0%)    3.7% (  -2% -   10%) 0.000
                        PKLookup      141.68      (2.0%)      177.85      (1.5%)   25.5% (  21% -   29%) 0.000

Dec 15 '23 06:12 Tony-X

Non-trivial amount of allocations for? .... building IndexInput slice descriptions !?

jdk.internal.misc.Unsafe#allocateUninitializedArray(). This was not trivial to find out why. But again with the raw JFR report, we can analyze the call tree. It turn out that in the buildSlice() implementation of MemorySegmentIndexInput we call IndexInput#getFullSliceDescription() which creates new String. And allocateUninitializedArray is called to allocate the bytes for the String.

AFAIK, the description is only used for debugging and tracking purposes. I didn't expect it'd cause that much of allocation. So I made a change to pass null when building the description so those allocations are gone.

Before 

PERCENT       HEAP SAMPLES  STACK
10.39%        12103M        java.lang.Long#valueOf()
9.91%         11543M        org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
8.91%         10383M        jdk.internal.misc.Unsafe#allocateUninitializedArray()

After
PERCENT       HEAP SAMPLES  STACK                                                                                                                              [37/1812]
25.97%        32791M        java.lang.Long#valueOf()
7.58%         9571M         org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13%         6482M         org.apache.lucene.util.FixedBitSet#<init>()

Dec 15 '23 07:12 Tony-X

Here is the even more interesting stuff. After all those allocation optimizations. I also implemented the on-paper more "efficient" algorithm to intersect FST and FSA for Terms.intersect(), which leverages the sorted nature of the FST arcs and FSA transitions from a given state (so at least we can binary search to advance with some skipping). FST in some cases have direct addressing which is exploited, too. As a side note -- it also uncovered a bug for the NFA impl which is tracked here https://github.com/apache/lucene/issues/12906.

But that change is not moving the needle at all for WildCard and Prefix3 search tasks.

Before 
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       62.85      (1.8%)       12.66      (0.6%)  -79.9% ( -80% -  -78%) 0.000
                          Fuzzy2       55.12      (1.0%)       18.92      (0.8%)  -65.7% ( -66% -  -64%) 0.000
                          Fuzzy1       61.20      (0.8%)       22.55      (0.8%)  -63.2% ( -64% -  -62%) 0.000
                         Respell       31.11      (0.8%)       11.95      (0.6%)  -61.6% ( -62% -  -60%) 0.000
                         Prefix3      135.69      (2.0%)       65.06      (0.7%)  -52.1% ( -53% -  -50%) 0.000
               HighTermTitleSort      119.58      (0.9%)      111.03      (1.7%)   -7.2% (  -9% -   -4%) 0.000
                          IntNRQ       22.25      (1.1%)       21.87      (1.5%)   -1.7% (  -4% -    0%) 0.000
                      HighPhrase       25.82      (3.6%)       25.55      (3.2%)   -1.1% (  -7% -    5%) 0.318
                       MedPhrase        7.41      (2.4%)        7.35      (2.2%)   -0.8% (  -5% -    3%) 0.259
                     LowSpanNear        8.81      (1.9%)        8.74      (2.1%)   -0.8% (  -4% -    3%) 0.202
          OrHighMedDayTaxoFacets        3.86      (5.8%)        3.83      (4.8%)   -0.8% ( -10% -   10%) 0.636
                      TermDTSort      100.75      (2.9%)       99.98      (2.0%)   -0.8% (  -5% -    4%) 0.336
            HighIntervalsOrdered        6.07      (2.1%)        6.03      (2.4%)   -0.7% (  -5% -    3%) 0.342
             MedIntervalsOrdered       45.89      (2.0%)       45.61      (2.4%)   -0.6% (  -4% -    3%) 0.389
                    HighSpanNear       10.73      (1.0%)       10.66      (1.5%)   -0.6% (  -3% -    2%) 0.165
           HighTermDayOfYearSort      206.09      (1.8%)      204.93      (1.9%)   -0.6% (  -4% -    3%) 0.338
             LowIntervalsOrdered        8.39      (2.3%)        8.37      (2.5%)   -0.3% (  -5% -    4%) 0.654
                     MedSpanNear       66.00      (1.3%)       65.81      (1.9%)   -0.3% (  -3% -    2%) 0.574
                         MedTerm      322.61      (4.7%)      321.89      (4.5%)   -0.2% (  -9% -    9%) 0.878
         AndHighMedDayTaxoFacets       22.62      (1.0%)       22.58      (1.2%)   -0.2% (  -2% -    2%) 0.617
                       LowPhrase       48.52      (1.3%)       48.46      (1.4%)   -0.1% (  -2% -    2%) 0.745
                         LowTerm      403.54      (2.9%)      403.22      (2.4%)   -0.1% (  -5% -    5%) 0.923
     BrowseRandomLabelTaxoFacets        3.20      (0.7%)        3.20      (0.9%)   -0.0% (  -1% -    1%) 0.905
        AndHighHighDayTaxoFacets        8.06      (1.3%)        8.06      (1.6%)    0.0% (  -2% -    2%) 0.962
       BrowseDayOfYearTaxoFacets        3.76      (0.6%)        3.76      (0.6%)    0.0% (  -1% -    1%) 0.859
           BrowseMonthTaxoFacets        3.62      (1.0%)        3.62      (1.0%)    0.1% (  -1% -    2%) 0.866
                   OrHighNotHigh      156.16      (6.0%)      156.26      (5.9%)    0.1% ( -11% -   12%) 0.972
            BrowseDateTaxoFacets        3.73      (0.6%)        3.73      (0.6%)    0.1% (  -1% -    1%) 0.722
                   OrNotHighHigh      144.55      (5.1%)      144.68      (4.7%)    0.1% (  -9% -   10%) 0.957
            MedTermDayTaxoFacets       17.57      (2.8%)       17.59      (2.8%)    0.2% (  -5% -    5%) 0.863
       BrowseDayOfYearSSDVFacets        3.22      (0.9%)        3.23      (0.7%)    0.2% (  -1% -    1%) 0.424
                        HighTerm      401.13      (5.4%)      402.30      (5.7%)    0.3% ( -10% -   12%) 0.868
           BrowseMonthSSDVFacets        3.27      (0.7%)        3.28      (1.0%)    0.3% (  -1% -    2%) 0.202
                     AndHighHigh       20.87      (2.8%)       20.94      (2.5%)    0.4% (  -4% -    5%) 0.670
                HighSloppyPhrase        6.58      (3.3%)        6.61      (3.4%)    0.4% (  -6% -    7%) 0.727
                      AndHighMed       89.34      (1.8%)       89.76      (1.4%)    0.5% (  -2% -    3%) 0.355
            BrowseDateSSDVFacets        0.95      (2.8%)        0.95      (4.0%)    0.5% (  -6% -    7%) 0.656
                    OrNotHighLow      420.17      (2.2%)      422.34      (2.2%)    0.5% (  -3% -    4%) 0.452
                 LowSloppyPhrase        2.89      (2.4%)        2.91      (1.9%)    0.6% (  -3% -    4%) 0.369
                    OrNotHighMed      219.50      (4.5%)      221.02      (4.1%)    0.7% (  -7% -    9%) 0.611
                 MedSloppyPhrase       10.44      (2.2%)       10.52      (1.4%)    0.7% (  -2% -    4%) 0.222
                    OrHighNotLow      288.48      (5.4%)      290.70      (5.7%)    0.8% (  -9% -   12%) 0.663
                       OrHighMed       53.25      (3.6%)       53.66      (3.5%)    0.8% (  -6% -    8%) 0.488
            HighTermTitleBDVSort        2.77      (7.2%)        2.79      (7.0%)    0.9% ( -12% -   16%) 0.699
                    OrHighNotMed      270.38      (5.7%)      272.88      (5.4%)    0.9% (  -9% -   12%) 0.601
                      OrHighHigh       20.86      (5.1%)       21.08      (5.2%)    1.1% (  -8% -   12%) 0.519
                       OrHighLow      220.40      (4.2%)      223.52      (5.6%)    1.4% (  -8% -   11%) 0.367
     BrowseRandomLabelSSDVFacets        2.32      (4.3%)        2.38      (7.6%)    2.3% (  -9% -   14%) 0.240
                      AndHighLow      395.82      (2.9%)      405.36      (2.8%)    2.4% (  -3% -    8%) 0.008
               HighTermMonthSort     2375.71      (3.7%)     2555.72      (5.0%)    7.6% (  -1% -   16%) 0.000
                        PKLookup      140.60      (1.8%)      178.02      (1.3%)   26.6% (  23% -   30%) 0.000


After
                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                        Wildcard       37.77      (2.7%)        5.70      (1.3%)  -84.9% ( -86% -  -83%) 0.000
                         Prefix3       52.27      (2.7%)       22.71      (1.8%)  -56.6% ( -59% -  -53%) 0.000
                          Fuzzy1       59.96      (1.6%)       54.80      (2.3%)   -8.6% ( -12% -   -4%) 0.000
               HighTermTitleSort      106.17      (2.1%)      101.30      (1.5%)   -4.6% (  -7% -   -1%) 0.000
                          Fuzzy2       33.40      (1.3%)       32.14      (1.6%)   -3.8% (  -6% -    0%) 0.000
                         MedTerm      273.84      (5.4%)      265.92      (9.7%)   -2.9% ( -17% -   12%) 0.245
                        HighTerm      349.66      (5.2%)      341.63      (8.9%)   -2.3% ( -15% -   12%) 0.320
                         LowTerm      356.23      (3.1%)      350.27      (4.3%)   -1.7% (  -8% -    5%) 0.156
                 LowSloppyPhrase        4.47      (2.1%)        4.44      (4.6%)   -0.8% (  -7% -    6%) 0.492
                    HighSpanNear        8.12      (2.1%)        8.06      (2.5%)   -0.7% (  -5% -    4%) 0.331
                 MedSloppyPhrase       31.05      (3.6%)       30.83      (4.1%)   -0.7% (  -8% -    7%) 0.559
             MedIntervalsOrdered        3.88      (3.3%)        3.86      (3.3%)   -0.7% (  -7% -    6%) 0.519
                     MedSpanNear        8.94      (1.2%)        8.88      (1.6%)   -0.7% (  -3% -    2%) 0.126
             LowIntervalsOrdered        7.40      (3.3%)        7.35      (3.4%)   -0.7% (  -7% -    6%) 0.537
                     LowSpanNear       29.33      (2.0%)       29.15      (2.2%)   -0.6% (  -4% -    3%) 0.374
                HighSloppyPhrase        6.68      (3.6%)        6.64      (3.5%)   -0.5% (  -7% -    6%) 0.624
            MedTermDayTaxoFacets        9.14      (3.0%)        9.11      (8.2%)   -0.3% ( -11% -   11%) 0.861
                      HighPhrase      115.62      (3.9%)      115.24      (4.0%)   -0.3% (  -7% -    7%) 0.798                                                [162/1927]
                     AndHighHigh       13.95      (4.2%)       13.92      (4.5%)   -0.3% (  -8% -    8%) 0.847
           BrowseMonthSSDVFacets        3.30      (0.8%)        3.29      (1.0%)   -0.3% (  -2% -    1%) 0.377
                      AndHighMed       85.40      (2.0%)       85.18      (2.1%)   -0.2% (  -4% -    3%) 0.695
                          IntNRQ       16.65      (4.1%)       16.63      (3.8%)   -0.1% (  -7% -    8%) 0.914
     BrowseRandomLabelSSDVFacets        2.30      (0.9%)        2.30      (1.1%)   -0.1% (  -2% -    1%) 0.754
                      OrHighHigh       24.99      (6.1%)       24.97      (5.3%)   -0.1% ( -10% -   12%) 0.957
        AndHighHighDayTaxoFacets        2.29      (2.8%)        2.29      (2.4%)   -0.0% (  -5% -    5%) 0.977
         AndHighMedDayTaxoFacets       40.17      (1.4%)       40.20      (1.4%)    0.1% (  -2% -    2%) 0.872
          OrHighMedDayTaxoFacets        3.15      (3.9%)        3.15      (3.4%)    0.1% (  -6% -    7%) 0.946
                       LowPhrase       30.23      (2.5%)       30.26      (2.4%)    0.1% (  -4% -    5%) 0.911
                    OrNotHighLow      201.78      (2.9%)      202.03      (3.0%)    0.1% (  -5% -    6%) 0.896
     BrowseRandomLabelTaxoFacets        3.20      (3.2%)        3.21      (4.1%)    0.1% (  -6% -    7%) 0.899
            HighIntervalsOrdered        0.42      (1.9%)        0.42      (1.6%)    0.2% (  -3% -    3%) 0.699
                    OrNotHighMed      235.49      (5.5%)      236.01      (4.7%)    0.2% (  -9% -   11%) 0.892
           BrowseMonthTaxoFacets        3.62      (1.0%)        3.63      (1.0%)    0.2% (  -1% -    2%) 0.477
                   OrNotHighHigh      329.77      (4.9%)      330.79      (5.5%)    0.3% (  -9% -   11%) 0.851
                       MedPhrase       35.79      (3.4%)       35.90      (3.4%)    0.3% (  -6% -    7%) 0.771
                      TermDTSort      112.10      (3.2%)      112.45      (3.4%)    0.3% (  -6% -    7%) 0.763
            BrowseDateSSDVFacets        0.97      (7.1%)        0.98     (10.0%)    0.4% ( -15% -   18%) 0.897
       BrowseDayOfYearSSDVFacets        3.21      (2.2%)        3.22      (1.6%)    0.4% (  -3% -    4%) 0.525
           HighTermDayOfYearSort      235.24      (2.1%)      236.16      (1.6%)    0.4% (  -3% -    4%) 0.512
                       OrHighMed       70.60      (3.3%)       70.99      (2.7%)    0.5% (  -5% -    6%) 0.571
                      AndHighLow      370.31      (3.2%)      372.60      (3.4%)    0.6% (  -5% -    7%) 0.559
            HighTermTitleBDVSort        5.53      (4.1%)        5.56      (4.5%)    0.6% (  -7% -    9%) 0.648
                    OrHighNotLow      263.18      (5.6%)      264.95      (6.2%)    0.7% ( -10% -   13%) 0.717
                    OrHighNotMed      222.41      (5.8%)      224.06      (5.8%)    0.7% ( -10% -   13%) 0.688
                   OrHighNotHigh      233.04      (5.5%)      234.89      (5.8%)    0.8% (  -9% -   12%) 0.657
                       OrHighLow      463.17      (3.0%)      466.91      (3.1%)    0.8% (  -5% -    7%) 0.403
       BrowseDayOfYearTaxoFacets        3.77      (0.6%)        3.84      (8.9%)    1.9% (  -7% -   11%) 0.342
            BrowseDateTaxoFacets        3.74      (0.6%)        3.81      (8.8%)    1.9% (  -7% -   11%) 0.332
               HighTermMonthSort     2350.73      (4.0%)     2477.32      (4.5%)    5.4% (  -2% -   14%) 0.000
                         Respell       30.81      (1.5%)       34.87      (1.7%)   13.2% (   9% -   16%) 0.000
                        PKLookup      141.03      (1.9%)      177.54      (2.0%)   25.9% (  21% -   30%) 0.000

I tried to modify the bench task file and only run WildCard to understand where the time is spent.

My version

mainline

So we can see that the most time is spent in actually reading out the FST arcs and FSA transitions... My intuitive explanation for why this is slower than the blocktree is that it has worse locality in its data access pattern. (@mikemccand maybe you can shed some light) Here are some relevant factors:

The FST is larger as it contains all terms. So there are more Arcs to visit. Blocktree (main) use the FST to index prefixes.
When binary-searching or directly address Arc/Transitions the target is somewhat random.
The FST bytes are read backwards. (probably less of an issue if we read sequentially on modern HW)
Blocktree at a given node reads bytes sequentially and terms are sorted, too.

Just out of curiosity I altered my code to load the FST on-heap to compare with the default off-heap option. It did not help much with Wildcard but PKLookup got substantially faster!

The PKLookup task is a great proxy to FST performance, as both versions of the code visits the exact same number of Arcs.


Off heap
                        Wildcard       47.56      (1.7%)       10.13      (0.4%)  -78.7% ( -79% -  -77%) 0.000
                        PKLookup      136.03      (2.4%)      147.93      (2.3%)    8.8% (   3% -   13%) 0.000

on heap
                        Wildcard       37.11      (1.5%)        8.35      (0.3%)  -77.5% ( -78% -  -76%) 0.000
                        PKLookup      136.04      (3.3%)      269.60      (9.0%)   98.2% (  83% -  114%) 0.000

Dec 15 '23 07:12 Tony-X

Theres are very interesting results @Tony-X! I'll try to give deeper response soon, but one idea that jumped out about Wildcard is that BlockTree somewhere takes advantage of commonSuffixBytes or so? This is a BytesRef that is non-empty when all strings matched by the Automaton share some common suffix, as would happen with a wildcard query like ab*cd (cd would be the common suffix).

Maybe block tree is able to use this opto more effectively than the "all terms in FST" approach? But I think you could implement such an opto too, maybe: just find the one node (or, maybe more than one) whose suffix is the common suffix, and fast-check somehow?

Dec 18 '23 14:12 mikemccand

The PKLookup gains are astounding!

Especially interesting is the off -> on heap gains for that task. We are somehow paying a high price for going through Lucene's IO APIs instead of byte[] backed ByteBuffer?

Dec 18 '23 14:12 mikemccand

Thanks @mikemccand for taking a look! I see the getCommonSuffixBytesRef method from Automaton.

I wonder if it is really applicable to the FST... i.e. for the FST is it guaranteed that there exists one and only one state where all valid outputs path that share the same suffix go through? Or put it in other words, how many sub-graphs of the FST are there that represents the same suffix?

Dec 19 '23 06:12 Tony-X

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!

Jan 13 '24 00:01 github-actions[bot]

Random access term dictionary

Description

Observations:

PKLookup has improvement

Fuzzy/Wildcard/Prefix queries got much slower

HighTermTitleSort and HighTermMonthSort got about 4.5% ~ 10% less throughput

AndHighLow got slower

Fuzzy/Wildcard/Prefix queries got much slower

HighTermTitleSort and HighTermMonthSort got about 4.5% ~ 10% less throughput

AndHighLow got slower

FST doesn't play nicely with primitive types (I know, this is more or less a java issue)

Non-trivial amount of allocations for? .... building IndexInput slice descriptions !?

My version

mainline

`HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput

`AndHighLow` got slower

`HighTermTitleSort` and `HighTermMonthSort` got about 4.5% ~ 10% less throughput

`AndHighLow` got slower