zstd icon indicating copy to clipboard operation
zstd copied to clipboard

I'm not 100% sure that block splitter is working properly

Open thatch opened this issue 3 years ago • 3 comments

Describe the bug

This is not a correctness issue. It simply appear to be leaving a lot of ratio unattained when faced with real-world data.

Side notes:

  • zstd is zstd 1.5.2-3 from archlinux
  • python-zstandard is 0.18.0 using backend_c
  • block splitter being an experimental param, I don't think I can enable separately in the zstd CLI or in the python-zstandard compression params. I am using level 19 with the assumption that it is always on with that level...

from https://github.com/facebook/zstd/issues/2832 uncompressed.bin https://gist.github.com/thatch/853de8089e17462f2e500a7f87c76736 zstd -19: 179956 bytes trivial subdivision in python: 129343 bytes

from https://github.com/facebook/zstd/issues/3014 test.bin https://gist.github.com/thatch/c2c2cc7304208423543edc38f2b7ed76 zstd -19: 2688424 bytes trivial subdivision in python: 1021806 bytes

Expected behavior Ratio to be as good (or a little better) than my trivial code.

For trivial random input with blocks of known entropy, it does appear to work. I have some evidence that this affects ELF files, which isn't super surprising because they likely have 4k-aligned sections.

thatch avatar Jul 18 '22 19:07 thatch

Thanks for the report! We know our block splitter is pretty trivial right now, and hope to find some time to work on it in the future. This case in particular is probably an interaction between block splitting, and the optimal parser, that tries to make decisions based on what it estimates the costs of particular matches are. So the block splitting changes the costs, but we don't re-parse based on the newly split blocks.

trivial subdivision in python: 1021806 bytes

How do you sub-divide in Python? I want to make sure that we can exactly reproduce both the good & bad cases, so we can investigate exactly what is going on to cause this huge regression.

terrelln avatar Jul 25 '22 18:07 terrelln

How do you sub-divide in Python?

Not very easily, unfortunately. See https://gist.github.com/thatch/c2e8aa3d34a7a548687f43813f9cb788

Usage: python compress_subdivide.py -19 input_filename.bin

After reading 128KB, it tries that as well as 64+64, and 32+32+32+32, etc (uniformly sized within a 128KB block). In initial testing there appeared to be a single minimum [sizes appeared to look like concave up], hence the "break" which avoids the worst of the performance penalty.

Note that the script will never choose something like 32+64+32, or 32+128+... (altering the stride) which are other potentials for wins if the determination is cheap enough [I think you might do the former, but not the latter]. Because I'm trying to keep accurate estimates of what the block would take to compress including use of backreferences I have to prime it with a fair amount of data, which must be recompressed after every context reset.

If there were an API to insert an already-compressed block/window worth of data, or copy a context, this would be a lot more performance for scripts like this. But I think in your shoes I'd say "zstd should be able to at least match this internally" and keep the simple public API.

thatch avatar Jul 25 '22 19:07 thatch

I'm currently revisiting this topic. After investigation, the main reason for the substantial compression ratios differences reported in this issue, for both https://github.com/facebook/zstd/issues/2832 and https://github.com/facebook/zstd/issues/3014, is not block splitting per se, but an indirect side effect.

The ratio difference comes from subtle reinforcement effects in the optimal parser's cost evaluation function. Minuscule differences in initial conditions can throw the cost function into one direction or the other. Splitting blocks differently can throw the evaluation into one direction or another, which is why trying multiple size combinations increases the odds to throw the cost function into the better direction (obviously, at a cpu cost).

The situation was improved in v1.5.6, and now the compression ratio is more stable and closer to the best scenario. Using mentioned samples, uncompressed.bin now compresses to 131,410 bytes, and test.bin to 831,424 bytes, which is in line or better than compression results reported here using the python multi-split-compare script. These results were achieved without modifying the current block splitting function.

It doesn't mean that the block splitting function is perfect, but improvements in this function are generally expected to be in the ~1% range, save unusually catastrophic scenarios.

Cyan4973 avatar Jun 24 '24 19:06 Cyan4973