zstd [WIP] Minor compression ratio improvements for high compression modes on small data

This PR bundles 3 modifications for high compression modes, targeting higher compression ratio on "small" (one-block) data. The benefit may extend beyond the first block, but they are vanishing (and so does the cost).

Replace "default" initial statistics by a first pass using greedy algorithms for collection of initial statistics in btultra. This step is globally positive, though benefits vary widely depending on file. It comes at a speed cost though, of around ~-10%, on one-block data (speed cost rapidly becomes insignificant for large inputs, like silesia.tar). Therefore, it was selected to not use this new method for btopt.
Change the way statistics are transferred from initial run. This modification impacts btultra2, which uses btultra as an initializer. The impact is small, but constantly positive, and there is no significant cost.
Chain greedy -> btultra -> btultra2 initialization. The impact is globally positive, but not always. There are wide variations depending on file type and input size. For this reason, the proposed heuristic is to enable the new method for inputs >= 8 KB, as the impact is rather positive for larger blocks, and rather negative for small ones. The speed cost, while not null, is not significant.

1. Using `greedy` to collect initial statistics for `btultra`

This step is globally beneficial, by non-negligible amounts. But there is also a sizable speed cost to it, so the trade-off is debatable. I think I'm rather favorable to it, since speed is less of a concern at btultra levels, but it also explains why this technique is not extended to the faster variant btopt.

For benchmarking, I'm using multiple block sizes, in order to observe the impact of initialization. This is compared to previous PR #2771 which updated initialization of btultra with slightly altered starting value, producing better compression ratios on small data.

2 KB blocks, btultra, level 14

file	`dev`	`ultra_init`	S diff	CR diff (%)	`dev`	`ultra_init`	speed diff %
14#mr	4185078	4209573	24495	-0.59%	19.1	17	-10.99%
14#webster	18304250	18261816	-42434	0.23%	10.8	10.19	-5.65%
14#nci	5543695	5529530	-14165	0.26%	5.77	5.96	3.29%
14#reymont	2721845	2716265	-5580	0.21%	10.5	9.66	-8.00%
14#osdb	7766458	7760004	-6454	0.08%	19.2	16.4	-14.58%
14#x-ray	6505968	6463068	-42900	0.66%	22	19	-13.64%
14#mozilla	22692372	22690769	-1603	0.01%	11	10.5	-4.55%
14#dickens	5322380	5314059	-8321	0.16%	14.1	12.3	-12.77%
14#xml	1565521	1565338	-183	0.01%	8.89	8.75	-1.57%
14#sao	6073600	6070141	-3459	0.06%	19.8	16.5	-16.67%
14#ooffice	3712546	3712520	-26	0.00%	13.9	12.5	-10.07%
14#samba	8169234	8165127	-4107	0.05%	11.7	11.3	-3.42%
total	92562947	92458210	-104737	0.11%			-8.22%

At 2K block size, the impact is small but globally positive, saving ~100 KB. But it comes at a cost, with speed down by -8.2% on average, due to the heavier initialization process using greedy instead of "blind" default values. There are very wide variations in this initialization cost depending on exact file. I'm unable to explain some of the most positive changes, such as a speed improvement for nci (+3.3%), but it was tested multiple times to confirm the timing, and all measurements are consistently accurate within ~0.2%, so it's not a fluke.

The best impact is achieved on x-ray (ratio +0.66%), but there is one negative impact, on mr, which is pretty large (ratio -0.59%). I haven't yet been able to find a clear explanation for this case.

followup investigation : the large negative impact on mr seems strongly correlated to the difference in initialization of literals. Essentially with "default" literals initialization "from source", the resulting literals statistics are a lot more flat than when using the outcome of greedy (which seems to leave a lot of literals). For some reason, flatter literals statistics seems better for mr. Unclear if this is a general conclusion, or really a specificity of mr.

One of the possible explanations could be that greedy leaves too many literals, thus producing more squeezed statistics favoring more frequent literals, making them cheaper hence preferable. These frequent literals might be over-represented, possible as a consequence of greedy being unable to find 3-bytes matches and thus eliminate corresponding literals. This impact is of course very variable depending on files, and it's also possible that in some other cases, the literals probabilities produced by greedy are accurate enough, maybe because there aren't many 3-bytes matches worth selecting, and trying to artificially distort these statistics, typically by flattening them, ends up hurting compression ratio (ex: nci).

16 KB blocks, btultra, level 14

file	`dev`	`ultra_init`	S diff	CR diff (%)	`dev`	`ultra_init`	speed diff %
14#mr	3701738	3694493	-7245	0.20%	14.4	12.9	-10.42%
14#webster	13689710	13681871	-7839	0.06%	7.56	7.29	-3.57%
14#nci	3340424	3326544	-13880	0.42%	5.24	5.41	3.24%
14#reymont	2070974	2072650	1676	-0.08%	7.26	6.82	-6.06%
14#osdb	5477403	5482835	5432	-0.10%	12.2	11.4	-6.56%
14#x-ray	5952779	5914123	-38656	0.65%	19.7	16.7	-15.23%
14#mozilla	18802426	18756948	-45478	0.24%	9.15	8.87	-3.06%
14#dickens	4301141	4302009	868	-0.02%	9.55	8.64	-9.53%
14#xml	849959	849524	-435	0.05%	7.44	7.4	-0.54%
14#sao	5401683	5396945	-4738	0.09%	15.8	13.4	-15.19%
14#ooffice	3132794	3123988	-8806	0.28%	11.5	10.6	-7.83%
14#samba	6013161	6002210	-10951	0.18%	9.51	9.35	-1.68%
total	72734192	72604140	-130052	0.18%			-6.37%

At 16 KB block size, there are a few improvements : compression benefits increases to 130 KB, now representing a 0.18% ratio improvement, and the speed cost is slightly better, at -6.4%.

The best case remains x-ray, with a ratio at +0.65%, but there are many changes in the other files. mozilla, which used to be neutral at 2K, is now a big positive contributor at 16K. mr, which was negative at 2K is now positive at 16K. A few files which were slightly positive at 2K become slightly negative at 16K. So it's all over the place. Difficult to draw conclusions from these details, other than it's "globally more positive" at 16K.

nci continues to defy expectation by running faster despite a heavier initialization process.

128 KB blocks, btultra, level 17

file	`dev`	`ultra_init`	S diff	CR diff (%)	`dev`	`ultra_init`	speed diff %
17#mr	3488777	3458279	-30498	0.87%	9.46	7.02	-25.79%
17#webster	11451515	11451879	364	0.00%	5.24	4.64	-11.45%
17#nci	2381888	2375040	-6848	0.29%	3.08	3.27	6.17%
17#reymont	1704086	1708679	4593	-0.27%	4.8	3.83	-20.21%
17#osdb	3717307	3725549	8242	-0.22%	7.02	6.46	-7.98%
17#x-ray	5604611	5562016	-42595	0.76%	12.9	10.6	-17.83%
17#mozilla	17205502	17098659	-106843	0.62%	6.41	6.13	-4.37%
17#dickens	3644846	3644167	-679	0.02%	5.95	4.74	-20.34%
17#xml	616638	615911	-727	0.12%	3.71	3.81	2.70%
17#sao	5142328	5144684	2356	-0.05%	10.03	7.62	-24.03%
17#ooffice	2884506	2874385	-10121	0.35%	8.05	7.17	-10.93%
17#samba	4912365	4888594	-23771	0.48%	5.63	5.46	-3.02%
total	62754369	62547842	-206527	0.33%			-11.42%

At 128 KB block size, the compression benefits increase even more, at > 200 KB, now representing a +0.33% ratio improvement, which is quite sizable for just a difference in statistics initialization.

The majority of the gains come from mozilla, mr and x-ray. mr is the new champion of compression ratio gains, at +0.87%. It's all the more remarkable as it was previously the only file in negative territory at 2K blocks. There are a few files which lose compression ratio (compared to default initialization), though it's never by large amounts. Once again, it's difficult to pinpoint a single clearly reproducible reason.

The average speed cost increases to -11.4%, with large differences depending on exact file. Worst speed costs are on mr and sao, with a drop of almost -25%, while nci and webster end up running faster despite the much heavier initialization. It might be possible to reduce the speed cost by making initialization faster, either by searching less or compressing a portion of the input, but this will likely have an impact on compression ratio, so would have to be carefully monitored.

follow-up investigation : limiting statistics analysis to the first 64 KB of the first block sensibly reduces the speed cost (by more than half). However, it also loses a part of the compression ratio gain in the process (between 1/4th and 1/3rd).

2. `btultra2` : modify statistics collection

btultra2 runs the first block with btultra, then throws away the result, but keeps the statistics, and re-starts compression of first block using these baseline statistics. The second modification changes the statistics transmission logic, rebuilding them from the seqStore, instead of simply inheriting the ones which are naturally stored into the optPtr as a consequence of running btultra.

This results in a small, yet consistently positive, compression ratio gain for btultra2, at essentially no cost. Note that, in this scenario, the first round of btultra still uses the default "blind" initialization (same as dev), not the greedy first pass mentioned in earlier part.

The benefit though is pretty small, and barely worth mentioning (I'm not detailing benchmarks here), averaging 0.01% ratio improvement. What makes it desirable is that it comes from free, and is a necessary ingredient for stage 3 to work properly.

3. `btultra2` : initialize statistics from `btultra`, itself initialized by `greedy`

So that's more or less the intended end game of this exercise. Let's see if, with a first round using greedy, statistics produced by btultra get better, resulting in a better final compression ratio for btultra2.

2K blocks, btultra2, level 19

file	newInit	fromGreedy	diff	CR gain %
19#webster	18258328	18249997	-8331	0.05%
19#mozilla	22656227	22664242	8015	-0.04%
19#mr	4162214	4184597	22383	-0.54%
19#dickens	5307928	5309449	1521	-0.03%
19#x-ray	6418504	6419605	1101	-0.02%
19#ooffice	3705876	3706233	357	-0.01%
19#xml	1563494	1564207	713	-0.05%
19#sao.zs	4047944	4047920	-24	0.00%
19#samba	8156414	8154485	-1929	0.02%
19#nci	5506582	5502844	-3738	0.07%
19#sao	6066564	6065637	-927	0.02%
19#reymont	2713590	2712485	-1105	0.04%
19#osdb	7757177	7755582	-1595	0.02%
total	96320842	96337283	16441	-0.02%

16K blocks, btultra2, level 19

file	newInit	fromGreedy	diff	CR gain %
19#webster	13651795	13659077	7282	-0.05%
19#mozilla	18754305	18722807	-31498	0.17%
19#mr	3666308	3666219	-89	0.00%
19#dickens	4282713	4293499	10786	-0.25%
19#x-ray	5895473	5891200	-4273	0.07%
19#ooffice	3120827	3118902	-1925	0.06%
19#xml	843970	844599	629	-0.07%
19#sao.zs	4028682	4028694	12	0.00%
19#samba	6002451	5994430	-8021	0.13%
19#nci	3264463	3259468	-4995	0.15%
19#sao	5387448	5388448	1000	-0.02%
19#reymont	2062941	2063663	722	-0.03%
19#osdb	5473816	5474898	1082	-0.02%
total	76435192	76405904	-29288	0.04%

128K blocks, btultra2, level 19

file	newInit	fromGreedy	diff	CR gain %
19#webster	11419219	11425166	5947	-0.05%
19#mozilla	17149792	17088589	-61203	0.36%
19#mr	3452739	3445413	-7326	0.21%
19#dickens	3632427	3637405	4978	-0.14%
19#x-ray	5737386	5522651	-214735	3.74%
19#ooffice	2877489	2874593	-2896	0.10%
19#xml	614969	614973	4	0.00%
19#sao.zs	4025964	4027040	1076	-0.03%
19#samba	4916583	4890852	-25731	0.52%
19#nci	2365886	2360953	-4933	0.21%
19#sao	5132210	5138098	5888	-0.11%
19#reymont	1698111	1700230	2119	-0.12%
19#osdb	3715581	3716201	620	-0.02%
total	66738356	66442164	-296192	0.44%

So that's interesting. For some reason, this technique doesn't work well for small blocks (2 KB). It does however tend to improve compression ratio for larger blocks, by a measurable margin, reaching an almost 300 KB gain at 128 KB blocks. That's a 0.44% ratio advantage, which is quite a lot for high compression modes.

For this reason, the logic proposed in this PR has a cutoff point at 8 KB block size : below that threshold, let's start with default statistics, above that, let's initialize with greedy.

Let's be clear though that this is not a universal gain : it depends on the file. In this sample set, the largest gain is achieved by x-ray at 128 KB, achieving a staggering +3.7% compression ratio gain, which is almost oversized for simple alteration of initial statistics. It's followed by samba (+0.52%) and mozilla (+0.36%), which are pretty respectable. On the flip side, dickens loses -0.14% and reymont -0.12%. So this is not universal. Results at 16 KB are much more tamed, with x-ray gains reduced to +0.07%, hence barely relevant, while dickens losses increase to -0.25%. The total of 16KB changes still achieve a compression ratio benefit, but by +0.04%, which is limited.

The oversized x-ray gains deserve an explanation. Sadly, I'm reduced to guesses right now. It likely receives a correct hint from initial statistics built with greedy, that make it capable to find a better global path. A classical example would be a path which favor usage of repcodes, over other path consisting of larger yet more costly matches. This shorter global path would be essentially rejected due to wrong cost estimation when using the more standard (dev) initialization. It adds up over the full block. However, for such explanation, I would have expected some good results at 16 KB block size too. Why it needs larger block size to really show up is unclear.

followup investigation : seems x-ray is "a medical picture of child's hand stored in 12-bit gray scaled image", maybe it needs a certain size for multiple lines to be present, hence for repcode to start making an impact (?). This is highly speculative though, as the image is "large" (8 MB) and compresses poorly, it's difficult to imagine +3.7% compression ratio gains just on repcode targeting previous line.

(sidenote : any speed cost due to greedy initialization is pretty minimalistic and irrelevant at this compression level).

Sep 16 '21 06:09 Cyan4973

Is this PR ready for review now?

Sep 30 '21 17:09 terrelln

Yes, it can be reviewed.

It's just not ready for "merge".

Sep 30 '21 17:09 Cyan4973

zstd zstd copied to clipboard

[WIP] Minor compression ratio improvements for high compression modes on small data

1. Using greedy to collect initial statistics for btultra

2. btultra2 : modify statistics collection

3. btultra2 : initialize statistics from btultra, itself initialized by greedy

zstd
zstd copied to clipboard

1. Using `greedy` to collect initial statistics for `btultra`

2. `btultra2` : modify statistics collection

3. `btultra2` : initialize statistics from `btultra`, itself initialized by `greedy`