zstd
zstd copied to clipboard
[WIP] Minor compression ratio improvements for high compression modes on small data
This PR bundles 3 modifications for high compression modes, targeting higher compression ratio on "small" (one-block) data. The benefit may extend beyond the first block, but they are vanishing (and so does the cost).
- Replace "default" initial statistics by a first pass using
greedy
algorithms for collection of initial statistics inbtultra
. This step is globally positive, though benefits vary widely depending on file. It comes at a speed cost though, of around ~-10%, on one-block data (speed cost rapidly becomes insignificant for large inputs, likesilesia.tar
). Therefore, it was selected to not use this new method forbtopt
. - Change the way statistics are transferred from initial run. This modification impacts
btultra2
, which usesbtultra
as an initializer. The impact is small, but constantly positive, and there is no significant cost. - Chain
greedy -> btultra -> btultra2
initialization. The impact is globally positive, but not always. There are wide variations depending on file type and input size. For this reason, the proposed heuristic is to enable the new method for inputs >= 8 KB, as the impact is rather positive for larger blocks, and rather negative for small ones. The speed cost, while not null, is not significant.
1. Using greedy
to collect initial statistics for btultra
This step is globally beneficial, by non-negligible amounts. But there is also a sizable speed cost to it, so the trade-off is debatable. I think I'm rather favorable to it, since speed is less of a concern at btultra
levels, but it also explains why this technique is not extended to the faster variant btopt
.
For benchmarking, I'm using multiple block sizes, in order to observe the impact of initialization. This is compared to previous PR #2771 which updated initialization of btultra
with slightly altered starting value, producing better compression ratios on small data.
2 KB blocks, btultra, level 14
file | dev |
ultra_init |
S diff | CR diff (%) | dev |
ultra_init |
speed diff % |
---|---|---|---|---|---|---|---|
14#mr | 4185078 | 4209573 | 24495 | -0.59% | 19.1 | 17 | -10.99% |
14#webster | 18304250 | 18261816 | -42434 | 0.23% | 10.8 | 10.19 | -5.65% |
14#nci | 5543695 | 5529530 | -14165 | 0.26% | 5.77 | 5.96 | 3.29% |
14#reymont | 2721845 | 2716265 | -5580 | 0.21% | 10.5 | 9.66 | -8.00% |
14#osdb | 7766458 | 7760004 | -6454 | 0.08% | 19.2 | 16.4 | -14.58% |
14#x-ray | 6505968 | 6463068 | -42900 | 0.66% | 22 | 19 | -13.64% |
14#mozilla | 22692372 | 22690769 | -1603 | 0.01% | 11 | 10.5 | -4.55% |
14#dickens | 5322380 | 5314059 | -8321 | 0.16% | 14.1 | 12.3 | -12.77% |
14#xml | 1565521 | 1565338 | -183 | 0.01% | 8.89 | 8.75 | -1.57% |
14#sao | 6073600 | 6070141 | -3459 | 0.06% | 19.8 | 16.5 | -16.67% |
14#ooffice | 3712546 | 3712520 | -26 | 0.00% | 13.9 | 12.5 | -10.07% |
14#samba | 8169234 | 8165127 | -4107 | 0.05% | 11.7 | 11.3 | -3.42% |
total | 92562947 | 92458210 | -104737 | 0.11% | -8.22% |
At 2K block size, the impact is small but globally positive, saving ~100 KB.
But it comes at a cost, with speed down by -8.2% on average, due to the heavier initialization process using greedy
instead of "blind" default values. There are very wide variations in this initialization cost depending on exact file. I'm unable to explain some of the most positive changes, such as a speed improvement for nci
(+3.3%), but it was tested multiple times to confirm the timing, and all measurements are consistently accurate within ~0.2%, so it's not a fluke.
The best impact is achieved on x-ray
(ratio +0.66%), but there is one negative impact, on mr
, which is pretty large (ratio -0.59%). I haven't yet been able to find a clear explanation for this case.
followup investigation : the large negative impact on mr
seems strongly correlated to the difference in initialization of literals. Essentially with "default" literals initialization "from source", the resulting literals statistics are a lot more flat than when using the outcome of greedy
(which seems to leave a lot of literals). For some reason, flatter literals statistics seems better for mr
. Unclear if this is a general conclusion, or really a specificity of mr
.
One of the possible explanations could be that greedy
leaves too many literals, thus producing more squeezed statistics favoring more frequent literals, making them cheaper hence preferable. These frequent literals might be over-represented, possible as a consequence of greedy
being unable to find 3-bytes matches and thus eliminate corresponding literals. This impact is of course very variable depending on files, and it's also possible that in some other cases, the literals probabilities produced by greedy
are accurate enough, maybe because there aren't many 3-bytes matches worth selecting, and trying to artificially distort these statistics, typically by flattening them, ends up hurting compression ratio (ex: nci
).
16 KB blocks, btultra, level 14
file | dev |
ultra_init |
S diff | CR diff (%) | dev |
ultra_init |
speed diff % |
---|---|---|---|---|---|---|---|
14#mr | 3701738 | 3694493 | -7245 | 0.20% | 14.4 | 12.9 | -10.42% |
14#webster | 13689710 | 13681871 | -7839 | 0.06% | 7.56 | 7.29 | -3.57% |
14#nci | 3340424 | 3326544 | -13880 | 0.42% | 5.24 | 5.41 | 3.24% |
14#reymont | 2070974 | 2072650 | 1676 | -0.08% | 7.26 | 6.82 | -6.06% |
14#osdb | 5477403 | 5482835 | 5432 | -0.10% | 12.2 | 11.4 | -6.56% |
14#x-ray | 5952779 | 5914123 | -38656 | 0.65% | 19.7 | 16.7 | -15.23% |
14#mozilla | 18802426 | 18756948 | -45478 | 0.24% | 9.15 | 8.87 | -3.06% |
14#dickens | 4301141 | 4302009 | 868 | -0.02% | 9.55 | 8.64 | -9.53% |
14#xml | 849959 | 849524 | -435 | 0.05% | 7.44 | 7.4 | -0.54% |
14#sao | 5401683 | 5396945 | -4738 | 0.09% | 15.8 | 13.4 | -15.19% |
14#ooffice | 3132794 | 3123988 | -8806 | 0.28% | 11.5 | 10.6 | -7.83% |
14#samba | 6013161 | 6002210 | -10951 | 0.18% | 9.51 | 9.35 | -1.68% |
total | 72734192 | 72604140 | -130052 | 0.18% | -6.37% |
At 16 KB block size, there are a few improvements : compression benefits increases to 130 KB, now representing a 0.18% ratio improvement, and the speed cost is slightly better, at -6.4%.
The best case remains x-ray
, with a ratio at +0.65%
, but there are many changes in the other files. mozilla
, which used to be neutral at 2K, is now a big positive contributor at 16K. mr
, which was negative at 2K is now positive at 16K. A few files which were slightly positive at 2K become slightly negative at 16K. So it's all over the place. Difficult to draw conclusions from these details, other than it's "globally more positive" at 16K.
nci
continues to defy expectation by running faster despite a heavier initialization process.
128 KB blocks, btultra, level 17
file | dev |
ultra_init |
S diff | CR diff (%) | dev |
ultra_init |
speed diff % |
---|---|---|---|---|---|---|---|
17#mr | 3488777 | 3458279 | -30498 | 0.87% | 9.46 | 7.02 | -25.79% |
17#webster | 11451515 | 11451879 | 364 | 0.00% | 5.24 | 4.64 | -11.45% |
17#nci | 2381888 | 2375040 | -6848 | 0.29% | 3.08 | 3.27 | 6.17% |
17#reymont | 1704086 | 1708679 | 4593 | -0.27% | 4.8 | 3.83 | -20.21% |
17#osdb | 3717307 | 3725549 | 8242 | -0.22% | 7.02 | 6.46 | -7.98% |
17#x-ray | 5604611 | 5562016 | -42595 | 0.76% | 12.9 | 10.6 | -17.83% |
17#mozilla | 17205502 | 17098659 | -106843 | 0.62% | 6.41 | 6.13 | -4.37% |
17#dickens | 3644846 | 3644167 | -679 | 0.02% | 5.95 | 4.74 | -20.34% |
17#xml | 616638 | 615911 | -727 | 0.12% | 3.71 | 3.81 | 2.70% |
17#sao | 5142328 | 5144684 | 2356 | -0.05% | 10.03 | 7.62 | -24.03% |
17#ooffice | 2884506 | 2874385 | -10121 | 0.35% | 8.05 | 7.17 | -10.93% |
17#samba | 4912365 | 4888594 | -23771 | 0.48% | 5.63 | 5.46 | -3.02% |
total | 62754369 | 62547842 | -206527 | 0.33% | -11.42% |
At 128 KB block size, the compression benefits increase even more, at > 200 KB, now representing a +0.33% ratio improvement, which is quite sizable for just a difference in statistics initialization.
The majority of the gains come from mozilla
, mr
and x-ray
. mr
is the new champion of compression ratio gains, at +0.87%. It's all the more remarkable as it was previously the only file in negative territory at 2K blocks. There are a few files which lose compression ratio (compared to default initialization), though it's never by large amounts. Once again, it's difficult to pinpoint a single clearly reproducible reason.
The average speed cost increases to -11.4%, with large differences depending on exact file. Worst speed costs are on mr
and sao
, with a drop of almost -25%, while nci
and webster
end up running faster despite the much heavier initialization.
It might be possible to reduce the speed cost by making initialization faster, either by searching less or compressing a portion of the input, but this will likely have an impact on compression ratio, so would have to be carefully monitored.
follow-up investigation : limiting statistics analysis to the first 64 KB of the first block sensibly reduces the speed cost (by more than half). However, it also loses a part of the compression ratio gain in the process (between 1/4th and 1/3rd).
2. btultra2
: modify statistics collection
btultra2
runs the first block with btultra
, then throws away the result, but keeps the statistics, and re-starts compression of first block using these baseline statistics.
The second modification changes the statistics transmission logic, rebuilding them from the seqStore
, instead of simply inheriting the ones which are naturally stored into the optPtr
as a consequence of running btultra
.
This results in a small, yet consistently positive, compression ratio gain for btultra2
, at essentially no cost.
Note that, in this scenario, the first round of btultra
still uses the default "blind" initialization (same as dev
), not the greedy
first pass mentioned in earlier part.
The benefit though is pretty small, and barely worth mentioning (I'm not detailing benchmarks here), averaging 0.01% ratio improvement. What makes it desirable is that it comes from free, and is a necessary ingredient for stage 3 to work properly.
3. btultra2
: initialize statistics from btultra
, itself initialized by greedy
So that's more or less the intended end game of this exercise. Let's see if, with a first round using greedy
, statistics produced by btultra
get better, resulting in a better final compression ratio for btultra2
.
2K blocks, btultra2, level 19
file | newInit | fromGreedy | diff | CR gain % |
---|---|---|---|---|
19#webster | 18258328 | 18249997 | -8331 | 0.05% |
19#mozilla | 22656227 | 22664242 | 8015 | -0.04% |
19#mr | 4162214 | 4184597 | 22383 | -0.54% |
19#dickens | 5307928 | 5309449 | 1521 | -0.03% |
19#x-ray | 6418504 | 6419605 | 1101 | -0.02% |
19#ooffice | 3705876 | 3706233 | 357 | -0.01% |
19#xml | 1563494 | 1564207 | 713 | -0.05% |
19#sao.zs | 4047944 | 4047920 | -24 | 0.00% |
19#samba | 8156414 | 8154485 | -1929 | 0.02% |
19#nci | 5506582 | 5502844 | -3738 | 0.07% |
19#sao | 6066564 | 6065637 | -927 | 0.02% |
19#reymont | 2713590 | 2712485 | -1105 | 0.04% |
19#osdb | 7757177 | 7755582 | -1595 | 0.02% |
total | 96320842 | 96337283 | 16441 | -0.02% |
16K blocks, btultra2, level 19
file | newInit | fromGreedy | diff | CR gain % |
---|---|---|---|---|
19#webster | 13651795 | 13659077 | 7282 | -0.05% |
19#mozilla | 18754305 | 18722807 | -31498 | 0.17% |
19#mr | 3666308 | 3666219 | -89 | 0.00% |
19#dickens | 4282713 | 4293499 | 10786 | -0.25% |
19#x-ray | 5895473 | 5891200 | -4273 | 0.07% |
19#ooffice | 3120827 | 3118902 | -1925 | 0.06% |
19#xml | 843970 | 844599 | 629 | -0.07% |
19#sao.zs | 4028682 | 4028694 | 12 | 0.00% |
19#samba | 6002451 | 5994430 | -8021 | 0.13% |
19#nci | 3264463 | 3259468 | -4995 | 0.15% |
19#sao | 5387448 | 5388448 | 1000 | -0.02% |
19#reymont | 2062941 | 2063663 | 722 | -0.03% |
19#osdb | 5473816 | 5474898 | 1082 | -0.02% |
total | 76435192 | 76405904 | -29288 | 0.04% |
128K blocks, btultra2, level 19
file | newInit | fromGreedy | diff | CR gain % |
---|---|---|---|---|
19#webster | 11419219 | 11425166 | 5947 | -0.05% |
19#mozilla | 17149792 | 17088589 | -61203 | 0.36% |
19#mr | 3452739 | 3445413 | -7326 | 0.21% |
19#dickens | 3632427 | 3637405 | 4978 | -0.14% |
19#x-ray | 5737386 | 5522651 | -214735 | 3.74% |
19#ooffice | 2877489 | 2874593 | -2896 | 0.10% |
19#xml | 614969 | 614973 | 4 | 0.00% |
19#sao.zs | 4025964 | 4027040 | 1076 | -0.03% |
19#samba | 4916583 | 4890852 | -25731 | 0.52% |
19#nci | 2365886 | 2360953 | -4933 | 0.21% |
19#sao | 5132210 | 5138098 | 5888 | -0.11% |
19#reymont | 1698111 | 1700230 | 2119 | -0.12% |
19#osdb | 3715581 | 3716201 | 620 | -0.02% |
total | 66738356 | 66442164 | -296192 | 0.44% |
So that's interesting. For some reason, this technique doesn't work well for small blocks (2 KB). It does however tend to improve compression ratio for larger blocks, by a measurable margin, reaching an almost 300 KB gain at 128 KB blocks. That's a 0.44% ratio advantage, which is quite a lot for high compression modes.
For this reason, the logic proposed in this PR has a cutoff point at 8 KB block size : below that threshold, let's start with default statistics, above that, let's initialize with greedy
.
Let's be clear though that this is not a universal gain : it depends on the file. In this sample set, the largest gain is achieved by x-ray
at 128 KB, achieving a staggering +3.7% compression ratio gain, which is almost oversized for simple alteration of initial statistics. It's followed by samba
(+0.52%) and mozilla
(+0.36%), which are pretty respectable.
On the flip side, dickens
loses -0.14% and reymont
-0.12%. So this is not universal.
Results at 16 KB are much more tamed, with x-ray
gains reduced to +0.07%, hence barely relevant, while dickens
losses increase to -0.25%. The total of 16KB changes still achieve a compression ratio benefit, but by +0.04%, which is limited.
The oversized x-ray
gains deserve an explanation. Sadly, I'm reduced to guesses right now. It likely receives a correct hint from initial statistics built with greedy
, that make it capable to find a better global path. A classical example would be a path which favor usage of repcodes, over other path consisting of larger yet more costly matches. This shorter global path would be essentially rejected due to wrong cost estimation when using the more standard (dev
) initialization. It adds up over the full block.
However, for such explanation, I would have expected some good results at 16 KB block size too. Why it needs larger block size to really show up is unclear.
followup investigation : seems x-ray
is "a medical picture of child's hand stored in 12-bit gray scaled image", maybe it needs a certain size for multiple lines to be present, hence for repcode to start making an impact (?). This is highly speculative though, as the image is "large" (8 MB) and compresses poorly, it's difficult to imagine +3.7% compression ratio gains just on repcode targeting previous line.
(sidenote : any speed cost due to greedy
initialization is pretty minimalistic and irrelevant at this compression level).
Is this PR ready for review now?
Yes, it can be reviewed.
It's just not ready for "merge".