Random access term dictionary
Description
Related issue https://github.com/apache/lucene/issues/12513
Opening this PR early to avoid massive diffs in one-shot
- [x] Encode (term type, local ord) in FST
- [x] Encode/Decode term states with bitpacking/unpacking.
TODO:
- [x] Implement bit-packing and unpacking for each term type
- [x] Implement the PostingsFormat
I'll try to review this soon -- it sounds compelling @Tony-X! I like how it is inspired by Tantivy's term dictionary format (which holds all terms + their metadata in RAM).
Also, with the upcoming ability to cleanly limit how much RAM the FSTCompiler is allowed to use to reduce the size of the FST, this approach becomes more feasible. Without that change, the FST compilation might easily use excessive RAM during indexing when merging large segments.
I'll also try to review! On the bit packing subject, I have some handy generic code (not in Lucene yet) to write and read variable size bits. Tell me if you are interested.
Thanks @bruno-roustant ! If you're okay to share it feel free to share it here.
I'm in the process of baking my own specific implementation (which internally uses a single long as bit buffer), but I might absorb some interesting ideas from your impl.
It seems like you have the low level encode/decode working? So all that remains is to hook that up with the Codec components that read/write the terms dict ... then you can test the Codec by passing -Dtests.codec=
and Lucene will run all tests cases using your Codec.
Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?
Just realized that we have lucene99 Codec out! I'll update the code to reflect that as this posting format aims to work with the latest Codec.
Thanks for the tips! Yes, almost there. I'm working on the real compact bitpacker and unpacker. I still need to implement the PostingFormat afterwards. Do you think I need to implement a new Codec?
You don't really need a new Codec -- you could test this easily by subclassing Lucene99Codec and overriding getPostingsFormatForField to use your new cool bit packing PostingsFormat.
After some tweaking and tinkering I was finally able to get the bench running the way I wanted in luceneutil! @mikemccand
Unfortunately luceneutil out of the box doesn't work for my case...
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 56.20 (2.0%) 7.30 (0.2%) -87.0% ( -87% - -86%) 0.000
Respell 44.00 (1.7%) 14.46 (0.6%) -67.1% ( -68% - -65%) 0.000
Fuzzy1 54.11 (1.2%) 20.86 (0.8%) -61.4% ( -62% - -60%) 0.000
Prefix3 103.61 (0.8%) 41.37 (0.5%) -60.1% ( -60% - -59%) 0.000
Fuzzy2 42.43 (1.2%) 20.26 (0.9%) -52.3% ( -53% - -50%) 0.000
HighTermTitleSort 127.60 (1.7%) 114.65 (1.6%) -10.1% ( -13% - -6%) 0.000
HighTermMonthSort 2549.22 (2.8%) 2435.74 (3.8%) -4.5% ( -10% - 2%) 0.037
AndHighLow 708.99 (2.3%) 678.15 (2.1%) -4.4% ( -8% - 0%) 0.001
LowTerm 369.16 (5.5%) 358.15 (3.1%) -3.0% ( -10% - 5%) 0.287
OrNotHighLow 258.51 (1.8%) 252.11 (2.2%) -2.5% ( -6% - 1%) 0.054
OrHighLow 348.58 (1.1%) 340.11 (2.5%) -2.4% ( -5% - 1%) 0.046
OrNotHighHigh 141.31 (6.7%) 138.54 (3.2%) -2.0% ( -11% - 8%) 0.554
IntNRQ 17.93 (1.8%) 17.62 (3.7%) -1.7% ( -7% - 3%) 0.349
MedSloppyPhrase 26.73 (0.8%) 26.28 (1.3%) -1.7% ( -3% - 0%) 0.017
HighIntervalsOrdered 3.88 (2.6%) 3.82 (2.3%) -1.6% ( -6% - 3%) 0.314
HighTerm 290.77 (7.3%) 286.20 (6.9%) -1.6% ( -14% - 13%) 0.727
OrHighNotHigh 296.51 (6.9%) 291.90 (4.9%) -1.6% ( -12% - 11%) 0.682
LowIntervalsOrdered 15.48 (1.1%) 15.26 (1.3%) -1.4% ( -3% - 0%) 0.057
MedTerm 405.54 (6.3%) 399.77 (5.9%) -1.4% ( -12% - 11%) 0.713
LowSloppyPhrase 44.64 (0.5%) 44.04 (2.1%) -1.3% ( -3% - 1%) 0.171
HighTermDayOfYearSort 177.36 (1.6%) 175.08 (1.8%) -1.3% ( -4% - 2%) 0.234
OrHighMed 79.44 (3.7%) 78.48 (3.0%) -1.2% ( -7% - 5%) 0.572
OrNotHighMed 268.82 (4.3%) 265.74 (3.1%) -1.1% ( -8% - 6%) 0.632
BrowseMonthTaxoFacets 3.89 (0.4%) 3.85 (1.2%) -0.9% ( -2% - 0%) 0.134
AndHighMedDayTaxoFacets 44.33 (0.7%) 43.96 (1.1%) -0.8% ( -2% - 0%) 0.145
MedSpanNear 30.67 (1.1%) 30.42 (2.3%) -0.8% ( -4% - 2%) 0.481
LowSpanNear 4.56 (0.8%) 4.53 (2.1%) -0.6% ( -3% - 2%) 0.576
HighSpanNear 8.52 (1.4%) 8.47 (2.1%) -0.5% ( -3% - 2%) 0.636
OrHighNotLow 236.32 (6.6%) 235.28 (4.4%) -0.4% ( -10% - 11%) 0.901
AndHighHighDayTaxoFacets 4.31 (0.6%) 4.29 (0.8%) -0.4% ( -1% - 1%) 0.446
LowPhrase 69.90 (1.2%) 69.67 (2.7%) -0.3% ( -4% - 3%) 0.807
AndHighMed 41.87 (0.6%) 41.75 (2.7%) -0.3% ( -3% - 3%) 0.828
OrHighHigh 45.86 (7.4%) 45.77 (7.9%) -0.2% ( -14% - 16%) 0.969
TermDTSort 101.90 (2.3%) 101.84 (2.0%) -0.1% ( -4% - 4%) 0.966
HighSloppyPhrase 0.48 (2.2%) 0.48 (2.6%) -0.0% ( -4% - 4%) 0.995
BrowseDateSSDVFacets 1.00 (3.7%) 1.00 (5.3%) -0.0% ( -8% - 9%) 1.000
HighTermTitleBDVSort 4.48 (5.4%) 4.49 (4.4%) 0.0% ( -9% - 10%) 0.990
BrowseDayOfYearSSDVFacets 3.51 (13.4%) 3.52 (13.3%) 0.1% ( -23% - 30%) 0.990
BrowseMonthSSDVFacets 3.38 (0.7%) 3.39 (0.6%) 0.1% ( -1% - 1%) 0.740
BrowseDayOfYearTaxoFacets 3.85 (0.3%) 3.85 (0.5%) 0.2% ( 0% - 0%) 0.469
MedIntervalsOrdered 1.65 (1.5%) 1.65 (1.5%) 0.2% ( -2% - 3%) 0.802
BrowseDateTaxoFacets 3.82 (0.3%) 3.83 (0.4%) 0.3% ( 0% - 0%) 0.264
BrowseRandomLabelSSDVFacets 2.34 (1.0%) 2.35 (0.5%) 0.3% ( -1% - 1%) 0.494
OrHighNotMed 296.34 (7.6%) 297.38 (4.6%) 0.4% ( -11% - 13%) 0.930
BrowseRandomLabelTaxoFacets 3.29 (0.6%) 3.31 (0.7%) 0.5% ( 0% - 1%) 0.275
MedPhrase 9.84 (2.1%) 9.90 (4.1%) 0.6% ( -5% - 6%) 0.787
AndHighHigh 33.50 (1.2%) 33.73 (3.4%) 0.7% ( -3% - 5%) 0.669
HighPhrase 15.15 (2.3%) 15.28 (4.1%) 0.9% ( -5% - 7%) 0.679
MedTermDayTaxoFacets 13.36 (1.8%) 13.55 (2.0%) 1.4% ( -2% - 5%) 0.240
OrHighMedDayTaxoFacets 3.73 (1.4%) 3.83 (2.2%) 2.6% ( 0% - 6%) 0.026
PKLookup 147.84 (1.6%) 157.37 (1.5%) 6.4% ( 3% - 9%) 0.000
Observations:
PKLookup has improvement
This is reasonable as the terms index (FST) holds all the terms.
Fuzzy/Wildcard/Prefix queries got much slower
This is also expected because currently I used the default implementation provided by TermsEnum which does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.
HighTermTitleSort and HighTermMonthSort got about 4.5% ~ 10% less throughput
I don't quite understand why term lookup could affect sorting on a DV field
AndHighLow got slower
Am i missing some optimization opportunity for low freq terms?
This is reasonable as the terms index (FST) holds all the terms.
+1, nice!
Fuzzy/Wildcard/Prefix queries got much slower
This is also expected because currently I used the default implementation provided by
TermsEnumwhich does not take advantage of the FST. With an optimized implementation I expect it to at least be on-par and slightly better because the FST holds information about all terms, whereas the current BlockTreeTerms only holds prefixes.
OK this makes sense, and it is a (sad) measure of how slow the emulated (on top of seekCeil) .intersect TermsEnum is. Once you have an optimized version it should likely be faster than block tree since it can intersect all suffixes instead of scanning byte[] suffixes in the term block and re-testing each.
HighTermTitleSortandHighTermMonthSortgot about 4.5% ~ 10% less throughputI don't quite understand why term lookup could affect sorting on a DV field
This is odd. Though, the HighTermMonthSort QPS is so crazy high as to not really be trustworthy -- likely BMW is kicking in and saving tons of work.
AndHighLowgot slowerAm i missing some optimization opportunity for low freq terms?
Hmm maybe pulsing? Are we still inlining single-occurrence terms directly into the terms dict with your new terms dict?
Since the first working version, I iterated with a list of profiling-guided allocation optimizations, as they stood out quite obviously from the merged JFR reports (thanks to luceneutil !).
Some of them comes from my code that implements the term dictionary data lookup, and a few of them are at more general Lucene level. I want to highlight the general issue I see from this work and maybe we can have separate issues to improve them!
Here is the first heap profile comparison (search-only, no indexing).
Candidate Heap
17.50% 24440M java.lang.Long#valueOf()
10.09% 14096M jdk.internal.misc.Unsafe#allocateUninitializedArray()
6.87% 9594M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
4.40% 6140M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
...
main
13.65% 11898M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
9.26% 8071M org.apache.lucene.util.FixedBitSet#<init>()
6.70% 5836M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
6.60% 5751M org.apache.lucene.util.ArrayUtil#growExact()
5.21% 4541M org.apache.lucene.facet.FacetsConfig#stringToPath()
4.69% 4090M org.apache.lucene.util.DocIdSetBuilder$Buffer#<init>()
FST doesn't play nicely with primitive types (I know, this is more or less a java issue)
24440M java.lang.Long#valueOf() huge amount of allocations... This is obvious. The FST<T> implementation is generic over its output type and in my case T is Long. So for trivial long add and subtract, the implementation would allocate an object. Not only it is wasteful but from a perf perspective it'd be less than 1 CPU cycle v.s. calling allocator which is easily tens if not hundreds of cycles.
For this work, I forked the FST<T> class and manually templated it with long just to see how much difference it makes. Here is a diff in heap profile and bench results before and after.
Before
PERCENT HEAP SAMPLES STACK
25.97% 32791M java.lang.Long#valueOf()
7.58% 9571M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13% 6482M org.apache.lucene.util.FixedBitSet#<init>()
4.90%
....
After
PERCENT HEAP SAMPLES STACK
8.44% 7988M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
7.17% 6788M org.apache.lucene.util.FixedBitSet#<init>()
6.22% 5886M org.apache.lucene.facet.sortedset.SortedSetDocValuesFacetCounts#countOneSegmentNHLD()
5.89% 5577M org.apache.lucene.util.ArrayUtil#growExact()
Bench diff
Before
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 11.61 (2.7%) 2.40 (0.6%) -79.4% ( -80% - -78%) 0.000
Fuzzy1 78.17 (0.7%) 27.16 (0.9%) -65.3% ( -66% - -64%) 0.000
Respell 29.09 (0.6%) 10.91 (0.8%) -62.5% ( -63% - -61%) 0.000
Fuzzy2 47.80 (0.6%) 18.50 (1.2%) -61.3% ( -62% - -59%) 0.000
Prefix3 765.08 (3.1%) 463.94 (0.9%) -39.4% ( -42% - -36%) 0.000
HighTermTitleSort 98.48 (2.0%) 90.62 (2.2%) -8.0% ( -11% - -3%) 0.000
BrowseMonthTaxoFacets 3.89 (29.2%) 3.62 (0.9%) -6.9% ( -28% - 32%) 0.293
LowSloppyPhrase 22.73 (6.5%) 22.35 (6.9%) -1.7% ( -14% - 12%) 0.432
LowTerm 365.47 (3.6%) 359.62 (2.9%) -1.6% ( -7% - 5%) 0.121
HighTerm 398.57 (5.1%) 393.16 (4.7%) -1.4% ( -10% - 8%) 0.380
MedSloppyPhrase 10.63 (3.6%) 10.51 (3.7%) -1.1% ( -8% - 6%) 0.339
MedTerm 422.73 (4.2%) 418.60 (4.0%) -1.0% ( -8% - 7%) 0.451
MedTermDayTaxoFacets 14.84 (2.6%) 14.71 (2.5%) -0.8% ( -5% - 4%) 0.296
HighSloppyPhrase 12.41 (3.1%) 12.33 (3.1%) -0.7% ( -6% - 5%) 0.487
HighTermTitleBDVSort 6.88 (3.3%) 6.84 (3.5%) -0.6% ( -7% - 6%) 0.599
LowPhrase 58.15 (2.9%) 57.85 (2.8%) -0.5% ( -6% - 5%) 0.567
BrowseDayOfYearSSDVFacets 3.24 (0.4%) 3.23 (0.5%) -0.3% ( -1% - 0%) 0.042
MedPhrase 26.19 (3.1%) 26.11 (3.2%) -0.3% ( -6% - 6%) 0.775
OrNotHighMed 185.23 (3.9%) 184.73 (3.3%) -0.3% ( -7% - 7%) 0.813
OrHighMedDayTaxoFacets 3.82 (3.3%) 3.81 (3.2%) -0.3% ( -6% - 6%) 0.796
OrHighNotLow 194.98 (5.1%) 194.51 (4.6%) -0.2% ( -9% - 10%) 0.875
OrHighNotMed 337.15 (4.4%) 336.53 (3.8%) -0.2% ( -7% - 8%) 0.888
IntNRQ 67.60 (0.9%) 67.55 (1.0%) -0.1% ( -1% - 1%) 0.783
MedSpanNear 9.85 (1.4%) 9.84 (2.1%) -0.1% ( -3% - 3%) 0.906
OrNotHighHigh 205.12 (4.1%) 205.01 (3.9%) -0.1% ( -7% - 8%) 0.967
AndHighHighDayTaxoFacets 6.35 (1.5%) 6.34 (1.7%) -0.0% ( -3% - 3%) 0.932
BrowseMonthSSDVFacets 3.29 (0.8%) 3.29 (0.7%) -0.0% ( -1% - 1%) 0.887
BrowseRandomLabelSSDVFacets 2.30 (0.8%) 2.30 (1.0%) 0.0% ( -1% - 1%) 0.919
LowSpanNear 16.41 (2.6%) 16.42 (2.7%) 0.1% ( -5% - 5%) 0.931
HighPhrase 77.12 (3.0%) 77.20 (3.6%) 0.1% ( -6% - 6%) 0.923
AndHighMedDayTaxoFacets 39.64 (1.2%) 39.68 (1.0%) 0.1% ( -2% - 2%) 0.742
BrowseRandomLabelTaxoFacets 3.19 (1.6%) 3.19 (1.1%) 0.1% ( -2% - 2%) 0.728
BrowseDateTaxoFacets 3.73 (0.7%) 3.74 (0.5%) 0.3% ( 0% - 1%) 0.157
AndHighHigh 27.08 (1.3%) 27.15 (3.0%) 0.3% ( -4% - 4%) 0.718
BrowseDayOfYearTaxoFacets 3.76 (0.6%) 3.77 (0.5%) 0.3% ( 0% - 1%) 0.072
HighTermDayOfYearSort 224.01 (2.1%) 224.81 (2.1%) 0.4% ( -3% - 4%) 0.592
HighSpanNear 6.09 (2.7%) 6.11 (3.1%) 0.4% ( -5% - 6%) 0.683
HighIntervalsOrdered 8.08 (3.3%) 8.11 (3.4%) 0.4% ( -6% - 7%) 0.705
TermDTSort 103.29 (4.4%) 103.83 (3.1%) 0.5% ( -6% - 8%) 0.669
MedIntervalsOrdered 33.12 (4.4%) 33.29 (4.6%) 0.5% ( -8% - 9%) 0.702
LowIntervalsOrdered 10.06 (3.9%) 10.12 (3.6%) 0.6% ( -6% - 8%) 0.609
AndHighMed 73.71 (2.2%) 74.18 (2.5%) 0.6% ( -3% - 5%) 0.394
OrHighMed 71.44 (2.7%) 71.98 (3.3%) 0.7% ( -5% - 6%) 0.429
BrowseDateSSDVFacets 0.96 (4.8%) 0.97 (5.7%) 0.9% ( -9% - 11%) 0.601
OrHighNotHigh 308.82 (4.0%) 311.53 (3.7%) 0.9% ( -6% - 8%) 0.470
OrHighLow 404.69 (3.0%) 408.63 (3.5%) 1.0% ( -5% - 7%) 0.348
OrHighHigh 20.44 (4.7%) 20.73 (7.2%) 1.4% ( -10% - 13%) 0.469
OrNotHighLow 381.28 (1.8%) 388.18 (2.1%) 1.8% ( -2% - 5%) 0.004
HighTermMonthSort 2500.04 (2.2%) 2554.91 (4.3%) 2.2% ( -4% - 8%) 0.042
AndHighLow 668.12 (3.1%) 692.04 (3.9%) 3.6% ( -3% - 10%) 0.001
PKLookup 140.25 (2.0%) 168.53 (1.9%) 20.2% ( 15% - 24%) 0.000
After
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 54.96 (2.6%) 10.43 (0.5%) -81.0% ( -82% - -80%) 0.000
Respell 45.54 (1.0%) 16.74 (0.7%) -63.2% ( -64% - -62%) 0.000
Fuzzy1 46.41 (1.2%) 17.26 (1.0%) -62.8% ( -64% - -61%) 0.000
Prefix3 121.65 (2.4%) 55.57 (0.9%) -54.3% ( -56% - -52%) 0.000
Fuzzy2 32.33 (1.2%) 15.79 (1.1%) -51.2% ( -52% - -49%) 0.000
HighTermTitleSort 95.24 (2.1%) 87.04 (1.9%) -8.6% ( -12% - -4%) 0.000
BrowseRandomLabelSSDVFacets 2.37 (7.1%) 2.33 (4.8%) -1.7% ( -12% - 10%) 0.374
BrowseMonthSSDVFacets 3.34 (7.3%) 3.29 (0.6%) -1.5% ( -8% - 6%) 0.362
TermDTSort 120.57 (2.3%) 119.02 (3.4%) -1.3% ( -6% - 4%) 0.163
OrHighHigh 19.13 (5.6%) 18.92 (2.7%) -1.1% ( -8% - 7%) 0.430
AndHighHigh 22.04 (5.1%) 21.87 (3.0%) -0.8% ( -8% - 7%) 0.555
AndHighMed 55.06 (3.0%) 54.79 (2.1%) -0.5% ( -5% - 4%) 0.546
HighSpanNear 3.29 (1.6%) 3.28 (1.7%) -0.5% ( -3% - 2%) 0.346
HighIntervalsOrdered 0.65 (1.8%) 0.65 (2.0%) -0.5% ( -4% - 3%) 0.433
HighTermDayOfYearSort 282.86 (2.0%) 281.57 (2.6%) -0.5% ( -4% - 4%) 0.533
MedIntervalsOrdered 16.36 (1.5%) 16.29 (1.5%) -0.4% ( -3% - 2%) 0.369
OrHighMed 68.27 (2.9%) 67.99 (1.8%) -0.4% ( -5% - 4%) 0.598
MedSpanNear 3.22 (1.0%) 3.21 (1.4%) -0.4% ( -2% - 2%) 0.317
HighSloppyPhrase 9.59 (2.5%) 9.57 (2.6%) -0.3% ( -5% - 4%) 0.733
BrowseMonthTaxoFacets 3.64 (2.4%) 3.63 (1.8%) -0.2% ( -4% - 4%) 0.756
LowIntervalsOrdered 14.66 (0.9%) 14.63 (1.5%) -0.2% ( -2% - 2%) 0.633
MedTermDayTaxoFacets 15.56 (2.7%) 15.54 (3.9%) -0.2% ( -6% - 6%) 0.879
AndHighMedDayTaxoFacets 18.70 (1.4%) 18.67 (3.7%) -0.2% ( -5% - 5%) 0.864
LowSpanNear 4.39 (1.1%) 4.38 (1.4%) -0.1% ( -2% - 2%) 0.728
OrHighMedDayTaxoFacets 5.38 (3.5%) 5.38 (5.4%) -0.1% ( -8% - 9%) 0.945
AndHighHighDayTaxoFacets 7.06 (1.6%) 7.06 (3.0%) -0.1% ( -4% - 4%) 0.924
LowSloppyPhrase 7.16 (1.4%) 7.15 (1.6%) -0.1% ( -2% - 2%) 0.891
MedSloppyPhrase 128.54 (1.9%) 128.56 (2.2%) 0.0% ( -4% - 4%) 0.979
LowTerm 417.80 (3.3%) 418.01 (2.7%) 0.1% ( -5% - 6%) 0.958
LowPhrase 125.59 (4.0%) 125.77 (3.1%) 0.1% ( -6% - 7%) 0.900
OrHighLow 313.22 (2.1%) 313.72 (2.2%) 0.2% ( -4% - 4%) 0.817
BrowseDateTaxoFacets 3.73 (0.7%) 3.74 (0.7%) 0.2% ( -1% - 1%) 0.470
BrowseDayOfYearTaxoFacets 3.76 (0.7%) 3.76 (0.7%) 0.2% ( -1% - 1%) 0.457
MedTerm 384.57 (4.6%) 385.44 (3.6%) 0.2% ( -7% - 8%) 0.863
OrHighNotHigh 255.07 (4.3%) 256.05 (4.3%) 0.4% ( -7% - 9%) 0.778
MedPhrase 11.17 (3.0%) 11.21 (2.6%) 0.4% ( -5% - 6%) 0.658
HighTerm 361.26 (5.1%) 362.86 (4.2%) 0.4% ( -8% - 10%) 0.764
BrowseRandomLabelTaxoFacets 3.19 (1.5%) 3.20 (0.6%) 0.5% ( -1% - 2%) 0.203
OrNotHighHigh 205.38 (4.0%) 206.35 (4.0%) 0.5% ( -7% - 8%) 0.712
OrNotHighLow 317.96 (1.7%) 319.48 (2.1%) 0.5% ( -3% - 4%) 0.428
HighPhrase 47.91 (3.8%) 48.15 (3.3%) 0.5% ( -6% - 7%) 0.661
BrowseDateSSDVFacets 0.97 (6.9%) 0.98 (6.7%) 0.5% ( -12% - 15%) 0.801
OrHighNotLow 185.96 (4.9%) 187.04 (5.0%) 0.6% ( -8% - 11%) 0.710
BrowseDayOfYearSSDVFacets 3.21 (2.1%) 3.23 (0.9%) 0.6% ( -2% - 3%) 0.225
HighTermTitleBDVSort 5.83 (3.7%) 5.87 (4.0%) 0.7% ( -6% - 8%) 0.584
OrNotHighMed 516.84 (2.5%) 520.76 (2.5%) 0.8% ( -4% - 5%) 0.334
IntNRQ 29.24 (3.0%) 29.50 (4.1%) 0.9% ( -6% - 8%) 0.425
OrHighNotMed 268.45 (4.4%) 270.92 (4.2%) 0.9% ( -7% - 9%) 0.501
HighTermMonthSort 2498.46 (4.8%) 2590.43 (3.7%) 3.7% ( -4% - 12%) 0.007
AndHighLow 747.94 (2.1%) 775.60 (4.0%) 3.7% ( -2% - 10%) 0.000
PKLookup 141.68 (2.0%) 177.85 (1.5%) 25.5% ( 21% - 29%) 0.000
Non-trivial amount of allocations for? .... building IndexInput slice descriptions !?
jdk.internal.misc.Unsafe#allocateUninitializedArray(). This was not trivial to find out why. But again with the raw JFR report, we can analyze the call tree. It turn out that in the buildSlice() implementation of MemorySegmentIndexInput we call IndexInput#getFullSliceDescription() which creates new String. And allocateUninitializedArray is called to allocate the bytes for the String.
AFAIK, the description is only used for debugging and tracking purposes. I didn't expect it'd cause that much of allocation. So I made a change to pass null when building the description so those allocations are gone.
Before
PERCENT HEAP SAMPLES STACK
10.39% 12103M java.lang.Long#valueOf()
9.91% 11543M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
8.91% 10383M jdk.internal.misc.Unsafe#allocateUninitializedArray()
After
PERCENT HEAP SAMPLES STACK [37/1812]
25.97% 32791M java.lang.Long#valueOf()
7.58% 9571M org.apache.lucene.facet.taxonomy.IntTaxonomyFacets#initializeValueCounters()
5.13% 6482M org.apache.lucene.util.FixedBitSet#<init>()
Here is the even more interesting stuff. After all those allocation optimizations. I also implemented the on-paper more "efficient" algorithm to intersect FST and FSA for Terms.intersect(), which leverages the sorted nature of the FST arcs and FSA transitions from a given state (so at least we can binary search to advance with some skipping). FST in some cases have direct addressing which is exploited, too. As a side note -- it also uncovered a bug for the NFA impl which is tracked here https://github.com/apache/lucene/issues/12906.
But that change is not moving the needle at all for WildCard and Prefix3 search tasks.
Before
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 62.85 (1.8%) 12.66 (0.6%) -79.9% ( -80% - -78%) 0.000
Fuzzy2 55.12 (1.0%) 18.92 (0.8%) -65.7% ( -66% - -64%) 0.000
Fuzzy1 61.20 (0.8%) 22.55 (0.8%) -63.2% ( -64% - -62%) 0.000
Respell 31.11 (0.8%) 11.95 (0.6%) -61.6% ( -62% - -60%) 0.000
Prefix3 135.69 (2.0%) 65.06 (0.7%) -52.1% ( -53% - -50%) 0.000
HighTermTitleSort 119.58 (0.9%) 111.03 (1.7%) -7.2% ( -9% - -4%) 0.000
IntNRQ 22.25 (1.1%) 21.87 (1.5%) -1.7% ( -4% - 0%) 0.000
HighPhrase 25.82 (3.6%) 25.55 (3.2%) -1.1% ( -7% - 5%) 0.318
MedPhrase 7.41 (2.4%) 7.35 (2.2%) -0.8% ( -5% - 3%) 0.259
LowSpanNear 8.81 (1.9%) 8.74 (2.1%) -0.8% ( -4% - 3%) 0.202
OrHighMedDayTaxoFacets 3.86 (5.8%) 3.83 (4.8%) -0.8% ( -10% - 10%) 0.636
TermDTSort 100.75 (2.9%) 99.98 (2.0%) -0.8% ( -5% - 4%) 0.336
HighIntervalsOrdered 6.07 (2.1%) 6.03 (2.4%) -0.7% ( -5% - 3%) 0.342
MedIntervalsOrdered 45.89 (2.0%) 45.61 (2.4%) -0.6% ( -4% - 3%) 0.389
HighSpanNear 10.73 (1.0%) 10.66 (1.5%) -0.6% ( -3% - 2%) 0.165
HighTermDayOfYearSort 206.09 (1.8%) 204.93 (1.9%) -0.6% ( -4% - 3%) 0.338
LowIntervalsOrdered 8.39 (2.3%) 8.37 (2.5%) -0.3% ( -5% - 4%) 0.654
MedSpanNear 66.00 (1.3%) 65.81 (1.9%) -0.3% ( -3% - 2%) 0.574
MedTerm 322.61 (4.7%) 321.89 (4.5%) -0.2% ( -9% - 9%) 0.878
AndHighMedDayTaxoFacets 22.62 (1.0%) 22.58 (1.2%) -0.2% ( -2% - 2%) 0.617
LowPhrase 48.52 (1.3%) 48.46 (1.4%) -0.1% ( -2% - 2%) 0.745
LowTerm 403.54 (2.9%) 403.22 (2.4%) -0.1% ( -5% - 5%) 0.923
BrowseRandomLabelTaxoFacets 3.20 (0.7%) 3.20 (0.9%) -0.0% ( -1% - 1%) 0.905
AndHighHighDayTaxoFacets 8.06 (1.3%) 8.06 (1.6%) 0.0% ( -2% - 2%) 0.962
BrowseDayOfYearTaxoFacets 3.76 (0.6%) 3.76 (0.6%) 0.0% ( -1% - 1%) 0.859
BrowseMonthTaxoFacets 3.62 (1.0%) 3.62 (1.0%) 0.1% ( -1% - 2%) 0.866
OrHighNotHigh 156.16 (6.0%) 156.26 (5.9%) 0.1% ( -11% - 12%) 0.972
BrowseDateTaxoFacets 3.73 (0.6%) 3.73 (0.6%) 0.1% ( -1% - 1%) 0.722
OrNotHighHigh 144.55 (5.1%) 144.68 (4.7%) 0.1% ( -9% - 10%) 0.957
MedTermDayTaxoFacets 17.57 (2.8%) 17.59 (2.8%) 0.2% ( -5% - 5%) 0.863
BrowseDayOfYearSSDVFacets 3.22 (0.9%) 3.23 (0.7%) 0.2% ( -1% - 1%) 0.424
HighTerm 401.13 (5.4%) 402.30 (5.7%) 0.3% ( -10% - 12%) 0.868
BrowseMonthSSDVFacets 3.27 (0.7%) 3.28 (1.0%) 0.3% ( -1% - 2%) 0.202
AndHighHigh 20.87 (2.8%) 20.94 (2.5%) 0.4% ( -4% - 5%) 0.670
HighSloppyPhrase 6.58 (3.3%) 6.61 (3.4%) 0.4% ( -6% - 7%) 0.727
AndHighMed 89.34 (1.8%) 89.76 (1.4%) 0.5% ( -2% - 3%) 0.355
BrowseDateSSDVFacets 0.95 (2.8%) 0.95 (4.0%) 0.5% ( -6% - 7%) 0.656
OrNotHighLow 420.17 (2.2%) 422.34 (2.2%) 0.5% ( -3% - 4%) 0.452
LowSloppyPhrase 2.89 (2.4%) 2.91 (1.9%) 0.6% ( -3% - 4%) 0.369
OrNotHighMed 219.50 (4.5%) 221.02 (4.1%) 0.7% ( -7% - 9%) 0.611
MedSloppyPhrase 10.44 (2.2%) 10.52 (1.4%) 0.7% ( -2% - 4%) 0.222
OrHighNotLow 288.48 (5.4%) 290.70 (5.7%) 0.8% ( -9% - 12%) 0.663
OrHighMed 53.25 (3.6%) 53.66 (3.5%) 0.8% ( -6% - 8%) 0.488
HighTermTitleBDVSort 2.77 (7.2%) 2.79 (7.0%) 0.9% ( -12% - 16%) 0.699
OrHighNotMed 270.38 (5.7%) 272.88 (5.4%) 0.9% ( -9% - 12%) 0.601
OrHighHigh 20.86 (5.1%) 21.08 (5.2%) 1.1% ( -8% - 12%) 0.519
OrHighLow 220.40 (4.2%) 223.52 (5.6%) 1.4% ( -8% - 11%) 0.367
BrowseRandomLabelSSDVFacets 2.32 (4.3%) 2.38 (7.6%) 2.3% ( -9% - 14%) 0.240
AndHighLow 395.82 (2.9%) 405.36 (2.8%) 2.4% ( -3% - 8%) 0.008
HighTermMonthSort 2375.71 (3.7%) 2555.72 (5.0%) 7.6% ( -1% - 16%) 0.000
PKLookup 140.60 (1.8%) 178.02 (1.3%) 26.6% ( 23% - 30%) 0.000
After
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
Wildcard 37.77 (2.7%) 5.70 (1.3%) -84.9% ( -86% - -83%) 0.000
Prefix3 52.27 (2.7%) 22.71 (1.8%) -56.6% ( -59% - -53%) 0.000
Fuzzy1 59.96 (1.6%) 54.80 (2.3%) -8.6% ( -12% - -4%) 0.000
HighTermTitleSort 106.17 (2.1%) 101.30 (1.5%) -4.6% ( -7% - -1%) 0.000
Fuzzy2 33.40 (1.3%) 32.14 (1.6%) -3.8% ( -6% - 0%) 0.000
MedTerm 273.84 (5.4%) 265.92 (9.7%) -2.9% ( -17% - 12%) 0.245
HighTerm 349.66 (5.2%) 341.63 (8.9%) -2.3% ( -15% - 12%) 0.320
LowTerm 356.23 (3.1%) 350.27 (4.3%) -1.7% ( -8% - 5%) 0.156
LowSloppyPhrase 4.47 (2.1%) 4.44 (4.6%) -0.8% ( -7% - 6%) 0.492
HighSpanNear 8.12 (2.1%) 8.06 (2.5%) -0.7% ( -5% - 4%) 0.331
MedSloppyPhrase 31.05 (3.6%) 30.83 (4.1%) -0.7% ( -8% - 7%) 0.559
MedIntervalsOrdered 3.88 (3.3%) 3.86 (3.3%) -0.7% ( -7% - 6%) 0.519
MedSpanNear 8.94 (1.2%) 8.88 (1.6%) -0.7% ( -3% - 2%) 0.126
LowIntervalsOrdered 7.40 (3.3%) 7.35 (3.4%) -0.7% ( -7% - 6%) 0.537
LowSpanNear 29.33 (2.0%) 29.15 (2.2%) -0.6% ( -4% - 3%) 0.374
HighSloppyPhrase 6.68 (3.6%) 6.64 (3.5%) -0.5% ( -7% - 6%) 0.624
MedTermDayTaxoFacets 9.14 (3.0%) 9.11 (8.2%) -0.3% ( -11% - 11%) 0.861
HighPhrase 115.62 (3.9%) 115.24 (4.0%) -0.3% ( -7% - 7%) 0.798 [162/1927]
AndHighHigh 13.95 (4.2%) 13.92 (4.5%) -0.3% ( -8% - 8%) 0.847
BrowseMonthSSDVFacets 3.30 (0.8%) 3.29 (1.0%) -0.3% ( -2% - 1%) 0.377
AndHighMed 85.40 (2.0%) 85.18 (2.1%) -0.2% ( -4% - 3%) 0.695
IntNRQ 16.65 (4.1%) 16.63 (3.8%) -0.1% ( -7% - 8%) 0.914
BrowseRandomLabelSSDVFacets 2.30 (0.9%) 2.30 (1.1%) -0.1% ( -2% - 1%) 0.754
OrHighHigh 24.99 (6.1%) 24.97 (5.3%) -0.1% ( -10% - 12%) 0.957
AndHighHighDayTaxoFacets 2.29 (2.8%) 2.29 (2.4%) -0.0% ( -5% - 5%) 0.977
AndHighMedDayTaxoFacets 40.17 (1.4%) 40.20 (1.4%) 0.1% ( -2% - 2%) 0.872
OrHighMedDayTaxoFacets 3.15 (3.9%) 3.15 (3.4%) 0.1% ( -6% - 7%) 0.946
LowPhrase 30.23 (2.5%) 30.26 (2.4%) 0.1% ( -4% - 5%) 0.911
OrNotHighLow 201.78 (2.9%) 202.03 (3.0%) 0.1% ( -5% - 6%) 0.896
BrowseRandomLabelTaxoFacets 3.20 (3.2%) 3.21 (4.1%) 0.1% ( -6% - 7%) 0.899
HighIntervalsOrdered 0.42 (1.9%) 0.42 (1.6%) 0.2% ( -3% - 3%) 0.699
OrNotHighMed 235.49 (5.5%) 236.01 (4.7%) 0.2% ( -9% - 11%) 0.892
BrowseMonthTaxoFacets 3.62 (1.0%) 3.63 (1.0%) 0.2% ( -1% - 2%) 0.477
OrNotHighHigh 329.77 (4.9%) 330.79 (5.5%) 0.3% ( -9% - 11%) 0.851
MedPhrase 35.79 (3.4%) 35.90 (3.4%) 0.3% ( -6% - 7%) 0.771
TermDTSort 112.10 (3.2%) 112.45 (3.4%) 0.3% ( -6% - 7%) 0.763
BrowseDateSSDVFacets 0.97 (7.1%) 0.98 (10.0%) 0.4% ( -15% - 18%) 0.897
BrowseDayOfYearSSDVFacets 3.21 (2.2%) 3.22 (1.6%) 0.4% ( -3% - 4%) 0.525
HighTermDayOfYearSort 235.24 (2.1%) 236.16 (1.6%) 0.4% ( -3% - 4%) 0.512
OrHighMed 70.60 (3.3%) 70.99 (2.7%) 0.5% ( -5% - 6%) 0.571
AndHighLow 370.31 (3.2%) 372.60 (3.4%) 0.6% ( -5% - 7%) 0.559
HighTermTitleBDVSort 5.53 (4.1%) 5.56 (4.5%) 0.6% ( -7% - 9%) 0.648
OrHighNotLow 263.18 (5.6%) 264.95 (6.2%) 0.7% ( -10% - 13%) 0.717
OrHighNotMed 222.41 (5.8%) 224.06 (5.8%) 0.7% ( -10% - 13%) 0.688
OrHighNotHigh 233.04 (5.5%) 234.89 (5.8%) 0.8% ( -9% - 12%) 0.657
OrHighLow 463.17 (3.0%) 466.91 (3.1%) 0.8% ( -5% - 7%) 0.403
BrowseDayOfYearTaxoFacets 3.77 (0.6%) 3.84 (8.9%) 1.9% ( -7% - 11%) 0.342
BrowseDateTaxoFacets 3.74 (0.6%) 3.81 (8.8%) 1.9% ( -7% - 11%) 0.332
HighTermMonthSort 2350.73 (4.0%) 2477.32 (4.5%) 5.4% ( -2% - 14%) 0.000
Respell 30.81 (1.5%) 34.87 (1.7%) 13.2% ( 9% - 16%) 0.000
PKLookup 141.03 (1.9%) 177.54 (2.0%) 25.9% ( 21% - 30%) 0.000
I tried to modify the bench task file and only run WildCard to understand where the time is spent.
My version
mainline
So we can see that the most time is spent in actually reading out the FST arcs and FSA transitions... My intuitive explanation for why this is slower than the blocktree is that it has worse locality in its data access pattern. (@mikemccand maybe you can shed some light) Here are some relevant factors:
- The FST is larger as it contains all terms. So there are more Arcs to visit. Blocktree (main) use the FST to index prefixes.
- When binary-searching or directly address Arc/Transitions the target is somewhat random.
- The FST bytes are read backwards. (probably less of an issue if we read sequentially on modern HW)
- Blocktree at a given node reads bytes sequentially and terms are sorted, too.
Just out of curiosity I altered my code to load the FST on-heap to compare with the default off-heap option. It did not help much with Wildcard but PKLookup got substantially faster!
The PKLookup task is a great proxy to FST performance, as both versions of the code visits the exact same number of Arcs.
Off heap
Wildcard 47.56 (1.7%) 10.13 (0.4%) -78.7% ( -79% - -77%) 0.000
PKLookup 136.03 (2.4%) 147.93 (2.3%) 8.8% ( 3% - 13%) 0.000
on heap
Wildcard 37.11 (1.5%) 8.35 (0.3%) -77.5% ( -78% - -76%) 0.000
PKLookup 136.04 (3.3%) 269.60 (9.0%) 98.2% ( 83% - 114%) 0.000
Theres are very interesting results @Tony-X! I'll try to give deeper response soon, but one idea that jumped out about Wildcard is that BlockTree somewhere takes advantage of commonSuffixBytes or so? This is a BytesRef that is non-empty when all strings matched by the Automaton share some common suffix, as would happen with a wildcard query like ab*cd (cd would be the common suffix).
Maybe block tree is able to use this opto more effectively than the "all terms in FST" approach? But I think you could implement such an opto too, maybe: just find the one node (or, maybe more than one) whose suffix is the common suffix, and fast-check somehow?
The PKLookup gains are astounding!
Especially interesting is the off -> on heap gains for that task. We are somehow paying a high price for going through Lucene's IO APIs instead of byte[] backed ByteBuffer?
Thanks @mikemccand for taking a look! I see the getCommonSuffixBytesRef method from Automaton.
I wonder if it is really applicable to the FST... i.e. for the FST is it guaranteed that there exists one and only one state where all valid outputs path that share the same suffix go through? Or put it in other words, how many sub-graphs of the FST are there that represents the same suffix?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!