lucene
lucene copied to clipboard
gh-13147: use dense bit-encoding for frequent terms
This is a proof-of-concept of encoding postings using dense bitmaps that encode both docs that have (with 1s) and don't have (with 0s) a term in them. It has various nocommits and it is still TBD whether this is worth pursuing. See #13147
Have you seen interesting performance numbers with this change?
right that last change seemed promising! But on luceneutil tasks it didn't show much impact, basically a regression to teh mean compared to the first revision I tested; possibly a slight negative trend overall. I'll paste the results here. I was considering creating a more targeted benchmark to exaggerate the impact using a field and query type where we would expect to see a benefit, if only as a debugging tool, but I haven't had a chance to do that yet. Another idea I had was to try to help branch prediction by using a specialized BlockDecoder or so rather than embedding if statements in every advance/nextDoc call. I'm also not sure whether writing full longs versus truncating the final long in the bitset and writing only as many bytes as needed (added in the recent revision) might be making some difference? Seems unlikely.
It seems to especially make phrase and span queries slower? Possibly the decoding of positions is not good? I could try restricting to fields without positions (which seem likely to be the ones that would benefit anyway)
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value [87/1978]
HighPhrase 4.62 (4.1%) 4.45 (3.9%) -3.7% ( -11% - 4%) 0.004
MedPhrase 67.01 (1.9%) 65.37 (2.0%) -2.4% ( -6% - 1%) 0.000
IntNRQ 15.92 (13.8%) 15.53 (13.2%) -2.4% ( -25% - 28%) 0.569
HighSloppyPhrase 3.60 (2.4%) 3.52 (2.8%) -2.3% ( -7% - 2%) 0.005
LowPhrase 9.44 (2.5%) 9.23 (2.4%) -2.2% ( -6% - 2%) 0.004
AndHighHighDayTaxoFacets 2.92 (2.3%) 2.88 (3.5%) -1.5% ( -7% - 4%) 0.108
LowSloppyPhrase 26.88 (2.5%) 26.48 (2.5%) -1.5% ( -6% - 3%) 0.065
MedSloppyPhrase 18.10 (3.2%) 17.85 (3.5%) -1.4% ( -7% - 5%) 0.191
HighTermDayOfYearSort 156.73 (2.2%) 154.64 (2.8%) -1.3% ( -6% - 3%) 0.096
TermDTSort 69.58 (2.6%) 68.83 (2.9%) -1.1% ( -6% - 4%) 0.216
AndHighMedDayTaxoFacets 15.06 (0.9%) 14.91 (1.6%) -1.0% ( -3% - 1%) 0.015
MedSpanNear 31.29 (1.5%) 30.97 (1.1%) -1.0% ( -3% - 1%) 0.017
HighIntervalsOrdered 3.92 (4.2%) 3.88 (4.5%) -1.0% ( -9% - 8%) 0.465
LowIntervalsOrdered 1.42 (3.0%) 1.41 (3.0%) -0.9% ( -6% - 5%) 0.335
LowSpanNear 14.23 (1.3%) 14.10 (0.8%) -0.9% ( -2% - 1%) 0.009
OrNotHighMed 207.12 (5.3%) 205.50 (5.8%) -0.8% ( -11% - 10%) 0.655
OrNotHighHigh 126.12 (5.2%) 125.14 (5.8%) -0.8% ( -11% - 10%) 0.655
HighTermTitleSort 79.33 (2.5%) 78.82 (2.0%) -0.6% ( -5% - 4%) 0.376
BrowseDateTaxoFacets 2.73 (5.7%) 2.71 (6.9%) -0.6% ( -12% - 12%) 0.752
BrowseRandomLabelTaxoFacets 2.11 (11.5%) 2.10 (11.9%) -0.6% ( -21% - 25%) 0.868
MedIntervalsOrdered 3.17 (3.9%) 3.15 (4.4%) -0.5% ( -8% - 8%) 0.694
HighSpanNear 3.15 (3.1%) 3.14 (2.2%) -0.5% ( -5% - 4%) 0.576
OrHighLow 247.95 (2.2%) 246.79 (1.9%) -0.5% ( -4% - 3%) 0.467
AndHighLow 396.47 (2.2%) 394.68 (1.5%) -0.4% ( -4% - 3%) 0.460
Fuzzy1 65.30 (1.5%) 65.01 (1.8%) -0.4% ( -3% - 2%) 0.402
BrowseDayOfYearTaxoFacets 2.77 (5.4%) 2.76 (6.2%) -0.4% ( -11% - 11%) 0.817
OrHighMed 142.70 (2.1%) 142.13 (2.0%) -0.4% ( -4% - 3%) 0.543
OrHighNotMed 279.04 (5.4%) 277.96 (6.2%) -0.4% ( -11% - 11%) 0.832
BrowseMonthSSDVFacets 2.38 (2.7%) 2.37 (2.9%) -0.4% ( -5% - 5%) 0.663
Fuzzy2 50.49 (1.3%) 50.32 (1.4%) -0.3% ( -3% - 2%) 0.420
OrHighHigh 21.69 (2.4%) 21.62 (3.6%) -0.3% ( -6% - 5%) 0.746
Respell 32.27 (2.4%) 32.19 (2.2%) -0.2% ( -4% - 4%) 0.761
BrowseMonthTaxoFacets 2.87 (1.4%) 2.86 (1.2%) -0.2% ( -2% - 2%) 0.654
PKLookup 142.74 (2.4%) 142.50 (1.9%) -0.2% ( -4% - 4%) 0.809
AndHighMed 53.54 (2.3%) 53.45 (2.2%) -0.2% ( -4% - 4%) 0.829
BrowseDayOfYearSSDVFacets 2.33 (2.4%) 2.32 (2.7%) -0.1% ( -5% - 5%) 0.872
MedTermDayTaxoFacets 6.54 (4.4%) 6.53 (3.9%) -0.1% ( -8% - 8%) 0.925
OrNotHighLow 248.01 (1.8%) 247.71 (1.7%) -0.1% ( -3% - 3%) 0.825
BrowseRandomLabelSSDVFacets 1.61 (2.9%) 1.61 (3.4%) -0.1% ( -6% - 6%) 0.941
OrHighMedDayTaxoFacets 1.26 (5.9%) 1.26 (6.7%) -0.1% ( -11% - 13%) 0.978
Wildcard 362.00 (2.1%) 362.14 (2.1%) 0.0% ( -4% - 4%) 0.953
HighTermTitleBDVSort 3.22 (6.1%) 3.22 (6.6%) 0.1% ( -11% - 13%) 0.957
OrHighNotHigh 137.13 (6.5%) 137.32 (7.0%) 0.1% ( -12% - 14%) 0.948
AndHighHigh 13.35 (4.2%) 13.37 (4.5%) 0.2% ( -8% - 9%) 0.903
OrHighNotLow 198.64 (7.3%) 199.07 (7.7%) 0.2% ( -13% - 16%) 0.926
LowTerm 269.15 (2.5%) 270.97 (4.5%) 0.7% ( -6% - 7%) 0.558
Prefix3 231.85 (2.7%) 234.29 (1.8%) 1.1% ( -3% - 5%) 0.143
HighTermMonthSort 1592.67 (3.3%) 1611.63 (3.0%) 1.2% ( -4% - 7%) 0.233
MedTerm 277.73 (5.0%) 282.57 (8.5%) 1.7% ( -11% - 15%) 0.428
HighTerm 192.36 (5.2%) 196.16 (9.1%) 2.0% ( -11% - 17%) 0.400
BrowseDateSSDVFacets 0.68 (7.3%) 0.70 (5.6%) 2.2% ( -9% - 16%) 0.284
after disabling this for fields with positions, luceneutil perf looks pretty flat. I think it simply doesn't have any test cases that would exercise this. I wrote a small benchmark that indexes a lot of dense terms of different frequencies and then runs top 100 searches using randomly-generated boolean queries on them designed that exploit those fields. This does seem to show a significant difference: index size reduction of 25% and 30% improvement in QPS. This test is artificial; there's a fair amount of variance (although I used a fixed random seed and multiple runs), and the index is pretty small. But I think it shows promise, so I'll try to figure out how to get a wikipedia-based test in to luceneutil. Not sure what realistic data there is to support it in linedocs. Perhaps if we index Month and Year as docs-only fields we would see some impact on queries with those as filters?
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!
Perhaps if we index Month and Year as docs-only fields we would see some impact on queries with those as filters?
+1 -- did that show any measurable performance change?
I'm very surprised it's hard to measure gains from a dense postings encoding! It ought to be super fast when it can apply: O(1) skipping time. Admittedly our postings APIs are designed for iteration so there is some API overhead due to that (you have to scan for next set bit, but, Java has fast primitives that get down to the silicon for this, right)?
+1 -- did that show any measurable performance change?
Well, sort of -- I did index months to get some dense postings field and added "month tasks" by tacking on +month:Mar
and so on to existing tasks. A few tasks created that way showed some improvement, but most showed no change or substantially worse performance. I concluded that it's hard to get the gains without getting in the way of "normal" postings advancement?
The only thing I've done that showed a clear improvement was a microbenchmark where I created an index composed entirely of these doc-only fields and queried it with random boolean queries involving terms in those fields.
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the [email protected] list. Thank you for your contribution!