zstd icon indicating copy to clipboard operation
zstd copied to clipboard

How to achieve better compression on 1MB block by re-using compression context

Open scherepanov opened this issue 2 years ago • 2 comments

I am compressing 1G JSON file, with zstd command, and getting 27.55 MB file (2.75% compression ratio - very impressive!!!!).

When I am compressing same file, breaking input into 1MB blocks, and compressing each block individually into frame, I am getting 41.7MB file, around 50% worse compression ratio than zstd command line tool.

Increasing block size to 10MB make compression ratio go closer to original but still far.

I suspect zstd utility is building dictionary on very large data sample, much larger than 10MB, and it makes much better compression ratio.

Obviously you need to reuse compression context, in hope that dictionary will be reused between invocations of ZSTD_compress.

Code (v1.5.2) explicitly saying it is not going to work:

Note : re-using context is just a speed / resource optimization. It doesn't change the compression ratio, which remains identical.

How I can achieve same compression ratio as on large file, still having individual 1MB blocks compressed into frames?

scherepanov avatar May 22 '22 19:05 scherepanov

How I can achieve same compression ratio as on large file, still having individual 1MB blocks compressed into frames?

Generally speaking, it's not possible. The problem is that, when starting a new frame, the history restart from zero, so it takes time to build up efficiency.

I personally believe that a 4.17% ratio at 1 MB blocks is pretty good, and would call it a day.

Now if you want a compression ratio a bit closer to single-frame, you might be tempted by dictionaries. This is a way to start a frame from a "non-zero" history. Of course, to be useful, such history should be relevant. You could try either using the embedded dictionary generator zstd --train, or decide to use the first block as a kind of "universal prefix" for all other blocks. Don't expect the compression ratio to completely "catch up" with single-frame, but it will certainly bridge the gap.

Cyan4973 avatar May 23 '22 04:05 Cyan4973

Thanks you very much for explanation!

It is not clear from your explanation - what does it mean "it takes time to build up efficiency" - is it a bigger size of dictionary (dictionary has to be stored in each frame and take space), or that dictionary is not optimal when build on 1MB block?

If dictionary size is the reason for less compression, you really cannot do anything. Dictionary has to be saved in every frame and takes space. Except train dictionary and save it at very first special block (very reasonable option, thank you for suggesting).

But, if not efficient dictionary is a problem, I can train new dictionary on whole file before compressing. I still want each frame to be independent - should be able to decompress individual frame without any other info.

4.17% compression ratio is an extremely good deal for real-time data. Then I need to archive data, and there size matter, and I do have unlimited time and CPU to compress data better.

FYI I have proprietary compression format, based on lz4 and zstd framing. I insert transport frame between data frames, with metadata (uncompressed offset, compressed/uncompressed frame size, frame number, compression type of data). My library allow for fast search on offset in compressed file without decompressing. That is somewhere in direction of searchable format you have, though it lacks extensibility. I also have a bunch of other functionality targeting our data - that custom extension makes it extremely useful. Unfortunately cannot publish code due to company policy.

Zstd/LZ4 rocks!!!

scherepanov avatar May 23 '22 12:05 scherepanov