EasyCompressor icon indicating copy to clipboard operation
EasyCompressor copied to clipboard

Deterministic compression

Open BenMcLean opened this issue 1 year ago • 8 comments
trafficstars

See this article here: https://dramsch.net/today-i-learned/gzip/today-i-learned-about-deterministic-gzip-compression/

I'd like to do the same thing but it appears EasyCompressor doesn't expose the necessary options to make GZip deterministic.

After looking into this a bit more, it appears that none of the EasyCompressor formats expose the necessary options to make them deterministic. We should be able to set the timestamp to 0 like in that article and get the exact same compressed output for the same decompressed input. If there's randomness, we should be able to control the seed.

BenMcLean avatar Nov 08 '24 13:11 BenMcLean

It's irrelevant to this library. The parameter -n of gzip command is to not include the timestamp of the original file. And it's all about compressing files. But this library is to compress/decompress data (such as byte[] or stream) not files. And those data do not have any timestamps.

mjebrahimi avatar Nov 11 '24 08:11 mjebrahimi

It's irrelevant to this library. The parameter -n of gzip command is to not include the timestamp of the original file. And it's all about compressing files. But this library is to compress/decompress data (such as byte[] or stream) not files. And those data do not have any timestamps.

OK well, I found in practice that EasyCompressor output is non-deterministic. Can any of it be made deterministic?

BenMcLean avatar Nov 11 '24 14:11 BenMcLean

Actually, it is deterministic out of the box. Since this library works with data (not files) and there isn't a timestamp here to include, so it's always deterministic. For example, if you compress the same (un-changed) data many times, the compressed outputs (and their hashes) will be the same.

mjebrahimi avatar Nov 11 '24 14:11 mjebrahimi

Actually, it is deterministic out of the box. Since this library works with data (not files) and there isn't a timestamp here to include, so it's always deterministic. For example, if you compress the same (un-changed) data many times, the compressed outputs (and their hashes) will be the same.

Oh, I think I know what happened.

I ran one test on Blazor WASM and another on Windows and got different results.

Maybe it's something to do with which platform.

BenMcLean avatar Nov 11 '24 21:11 BenMcLean

I reproduced your example with different compressors and I found a there is a weird difference in GZip compressed output between server-side .NET and client-side (Blazor WASM).

Brotli is not supported on Blazor WASM and the others (Deflate, LZ4, LZMA, Zstd, and Snappy) algorithms work fine (the same) between server and browser.

GZip compressed output (and thereby its hash) is different between server and client. However, the uncompressed data is equal to the original data before compression.

I should investigate more on it to find if it's a mistake implementation in this library or if it's a BUG for .NET runtime.

The Repo: https://github.com/mjebrahimi/BlazorWebAssembly-GZip-Difference

Screenshot

mjebrahimi avatar Nov 13 '24 14:11 mjebrahimi

Yeah sorry I didn't realize I'd actually done the two tests on different runtimes when I made the initial post. Not a big deal: it just explains why my unit test failed. Thanks. :)

BenMcLean avatar Nov 13 '24 15:11 BenMcLean

You're welcome. Anyway, it's an interesting problem you found and I will inform you with an update in a few days after more investigation on it.

mjebrahimi avatar Nov 13 '24 15:11 mjebrahimi

It seems to also affect System.IO.Compression on .NET Standard 2.0 as well so apparently it isn't specific to EasyCompressor.

BenMcLean avatar Nov 13 '24 16:11 BenMcLean