lucene
lucene copied to clipboard
Try encoding very frequent terms using a dense bitmap
Description
In case we have very dense terms we may be able to a better job encoding postings if we use a dense bitmap encoding where every doc is represented with a 1 (if it has the term) or 0 if it does not. For example there can be "boolean" fields used to indicate whether a doc is a parent or child doc (for use with ToParentBlockJoin).
In general if a term has freq P (docFreq/maxFreq) a dense encoding will require 1/P bits per doc (that is per posting, ie doc+term). In contrast our current PackedBits encoding requirement varies with the distribution of the terms among docs (since it encodes deltas). If the terms are uniformly distributed (every P docs) then the maximum gap is minimized and the bitrate will be ln2(P). But if the terms are not completely uniform among the docs it will require more. Modeling with a uniform random distribution we find that dense encoding can be more efficient for high P. I observed empirically that for P=1/2 PackedBits uses 3 bits per doc, P=1/4: 4 bits/doc and for P=1/8, 6 bits per doc. So there is a break-even at P=1/4 where the dense encoding and packed bits delta encoding have the same bitrate.
A dense encoding can have other benefits since it is very simple to read, supporting random access within a block. I think we should try making a block-by-block decision whether to use this encoding and see if we can find some benefit in reducing query time and possibly better compression as well.
One question I have is how to indicate the dense encoding of a block. I see our blocks start with a single byte that indicates number of packed bits per doc, 0 means the block is totally dense, the lower 5 bits indicate the bits-per-doc, and the upper 3 are an exception count. Are we using the exception count? Oh I see our utilities still support it but it was removed from Lucene99PostingsFormat. Perhaps at least for experimentation purposes I can safely use one of those bits
I have some initial implementation working in BlockDocsEnum, but one thing I'm unsure about is whether to provide it in all of the PostingsEnum/ImpactsEnum specializations. I feel like this is only likely to be used and helpful for docs-only fields (maybe?), but is it also possible that such fields can be iterated over by these other enums? I see that postings() returns a BlockDocsEnum when there are no positions, yet in tests I am seeing CheckIndex gets a handle on an EverythingEnum (and other enums) over a test field indexed with no positions and no freqs.
I ran luceneutil over wikimediumall. The index size was slightly reduced:
65200 ../indices/baseline/facets
18923720 ../indices/baseline/index
18988924 ../indices/baseline
65204 ../indices/candidate/facets
18774956 ../indices/candidate/index
18840164 ../indices/candidate
in a microbenchmark where I indexed random doc-only postings I saw ~28% index size reduction.
query performance does seem to have registered some actual change:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value [178/1805]
OrHighNotLow 124.19 (6.0%) 111.98 (6.7%) -9.8% ( -21% - 3%) 0.000
LowSpanNear 1.50 (1.1%) 1.42 (1.2%) -4.8% ( -7% - -2%) 0.000
HighTermTitleSort 86.63 (3.0%) 82.70 (2.2%) -4.5% ( -9% - 0%) 0.000
MedIntervalsOrdered 3.25 (4.3%) 3.11 (4.3%) -4.1% ( -12% - 4%) 0.003
OrHighHigh 23.47 (6.7%) 22.61 (3.3%) -3.7% ( -12% - 6%) 0.029
LowIntervalsOrdered 4.20 (4.1%) 4.05 (4.1%) -3.5% ( -11% - 4%) 0.007
AndHighHigh 25.46 (8.5%) 24.57 (4.9%) -3.5% ( -15% - 10%) 0.114
BrowseRandomLabelTaxoFacets 2.05 (14.8%) 1.98 (11.0%) -3.4% ( -25% - 26%) 0.405
HighIntervalsOrdered 2.09 (5.3%) 2.02 (5.4%) -3.1% ( -13% - 7%) 0.063
HighSpanNear 4.25 (1.9%) 4.13 (2.0%) -2.8% ( -6% - 1%) 0.000
OrHighMed 43.34 (3.1%) 42.18 (2.1%) -2.7% ( -7% - 2%) 0.001
BrowseDateTaxoFacets 2.78 (7.6%) 2.70 (6.6%) -2.7% ( -15% - 12%) 0.234
BrowseDayOfYearTaxoFacets 2.81 (7.2%) 2.74 (6.2%) -2.5% ( -14% - 11%) 0.236
Prefix3 126.88 (2.3%) 123.78 (3.5%) -2.4% ( -8% - 3%) 0.009
MedSpanNear 11.93 (0.9%) 11.65 (1.1%) -2.3% ( -4% - 0%) 0.000
OrHighNotMed 141.45 (5.1%) 138.33 (7.0%) -2.2% ( -13% - 10%) 0.254
AndHighMed 36.62 (5.6%) 35.82 (3.1%) -2.2% ( -10% - 6%) 0.124
MedPhrase 67.69 (2.9%) 66.22 (2.6%) -2.2% ( -7% - 3%) 0.013
HighSloppyPhrase 10.38 (1.6%) 10.20 (1.5%) -1.8% ( -4% - 1%) 0.000
IntNRQ 8.57 (14.4%) 8.42 (16.1%) -1.8% ( -28% - 33%) 0.713
HighTerm 271.19 (4.0%) 266.87 (5.1%) -1.6% ( -10% - 7%) 0.271
MedSloppyPhrase 8.12 (1.9%) 8.00 (2.5%) -1.6% ( -5% - 2%) 0.028
HighPhrase 39.43 (3.8%) 38.94 (3.1%) -1.2% ( -7% - 5%) 0.257
MedTerm 235.50 (3.4%) 232.58 (4.7%) -1.2% ( -9% - 7%) 0.339
LowPhrase 46.81 (2.8%) 46.27 (2.3%) -1.2% ( -6% - 4%) 0.157
OrHighNotHigh 147.42 (4.7%) 145.78 (6.2%) -1.1% ( -11% - 10%) 0.525
TermDTSort 88.33 (2.8%) 87.38 (1.8%) -1.1% ( -5% - 3%) 0.151
HighTermDayOfYearSort 152.37 (2.1%) 150.79 (1.8%) -1.0% ( -4% - 2%) 0.093
LowTerm 254.01 (1.9%) 251.72 (2.6%) -0.9% ( -5% - 3%) 0.207
LowSloppyPhrase 24.52 (0.9%) 24.32 (1.4%) -0.8% ( -3% - 1%) 0.029
OrNotHighHigh 199.37 (3.8%) 197.74 (4.9%) -0.8% ( -9% - 8%) 0.557
HighTermMonthSort 1581.75 (2.6%) 1569.14 (2.1%) -0.8% ( -5% - 4%) 0.292
OrNotHighMed 134.43 (2.7%) 133.51 (3.3%) -0.7% ( -6% - 5%) 0.471
OrHighLow 279.41 (2.1%) 277.84 (2.2%) -0.6% ( -4% - 3%) 0.412
Fuzzy1 64.73 (1.5%) 64.48 (0.7%) -0.4% ( -2% - 1%) 0.302
OrHighMedDayTaxoFacets 3.84 (6.3%) 3.83 (5.4%) -0.4% ( -11% - 12%) 0.845
AndHighMedDayTaxoFacets 31.84 (1.2%) 31.74 (1.5%) -0.3% ( -2% - 2%) 0.444
Fuzzy2 36.90 (1.3%) 36.80 (0.8%) -0.3% ( -2% - 1%) 0.383
BrowseRandomLabelSSDVFacets 1.57 (5.5%) 1.57 (3.8%) -0.2% ( -9% - 9%) 0.906
PKLookup 140.43 (1.7%) 140.30 (2.1%) -0.1% ( -3% - 3%) 0.876
AndHighLow 279.44 (2.2%) 279.34 (2.3%) -0.0% ( -4% - 4%) 0.958
OrNotHighLow 345.34 (1.7%) 345.21 (1.9%) -0.0% ( -3% - 3%) 0.948
Respell 33.36 (1.5%) 33.38 (1.3%) 0.1% ( -2% - 2%) 0.881
MedTermDayTaxoFacets 10.12 (2.4%) 10.13 (2.4%) 0.1% ( -4% - 4%) 0.912
BrowseDayOfYearSSDVFacets 2.32 (5.4%) 2.33 (3.3%) 0.1% ( -8% - 9%) 0.953
HighTermTitleBDVSort 4.74 (3.3%) 4.74 (4.0%) 0.1% ( -6% - 7%) 0.902
Wildcard 136.61 (2.5%) 136.82 (2.2%) 0.2% ( -4% - 4%) 0.831
BrowseDateSSDVFacets 0.68 (13.1%) 0.68 (13.0%) 0.4% ( -22% - 30%) 0.928
BrowseMonthTaxoFacets 2.84 (3.8%) 2.87 (1.3%) 1.1% ( -3% - 6%) 0.207
BrowseMonthSSDVFacets 2.38 (5.1%) 2.41 (4.0%) 1.3% ( -7% - 11%) 0.362
AndHighHighDayTaxoFacets 3.23 (3.6%) 3.34 (2.9%) 3.5% ( -2% - 10%) 0.001
so this looks positive. I can try tuning the decision parameter controlling which encoding to use to see what impact that may have. I guess what I wonder is whether the added complexity is worth chasing this, but I'm pretty encouraged that the overhead of the conditionals isn't overwhelming the "within-block skipping" this affords.
also -- the slowdown for AndHighHighDayTaxoFacets
counters the overall trend. I wonder what's going on there.
The results are kind of noisy -- on re-running AndHighHighDayTaxoFacets
didn't show much change but there was a ~8.4% regression for AndHighMedDayTaxoFacets
, and the cast of characters on the Plus side also look somewhat different
I tried increasing the usage of dense encoding by enabling it when it would consume up to 3/2 as many bits as packed bits encoding, rather than using it only when it would use up to the same amount. The wikipedia index produced is larger than the first candidate but still smaller than the baseline. Query performance does seem somewhat improved. I'm going to stop posting walls of numbers here until I have a chance to try some further improvements:
- writing only the bytes needed rather than full-width longs
- possibly we can save some time in advance by combining bit-counting with finding the next set bit
- Maybe there is a way to save the overhead of the conditionals that determine which block encoding to decode. EG by introducing a block-reader although then we might get another function call overhead in place of the conditional
- I want to test with other Amazon indexes as well
- It would be nice to have a theory about why the faceting test cases seem to see worse perf. I guess they are different in that they do not use top-N collection, so no score-based skipping?
I am seeing CheckIndex gets a handle on an EverythingEnum (and other enums) over a test field indexed with no positions and no freqs.
Hmm, does CheckIndex
pull all the various enums (docs only, docs+freqs, docs+freqs+positions, etc.) when testing? That might be a checking gap if not.
also -- the slowdown for AndHighHighDayTaxoFacets counters the overall trend. I wonder what's going on there.
Wait -- this task got faster right? And some others got slower, e.g. OrHighNotLow
?
oh dear you are correct - I was reading this table backwards! OK, so perhaps this "optimization" is not really helping very much at query time. Still it might be possible to squeeze some more juice from it.
I really love this idea! And it's wild that it's net/net reducing enwiki
index size even at higher than expected cutover to dense encoding criteria!