FiniteStateEntropy icon indicating copy to clipboard operation
FiniteStateEntropy copied to clipboard

question: FSE doesn't always compress better than huf - expected?

Open adamdmoss opened this issue 5 years ago • 4 comments

I can't share the data, but in short, it's 9108363519 bytes (~9GB) of almost-uncompressible data (IIRC it's the 9GB tail of a larger already-compressed stream).

% ./fse -e ./almost-uncompressable Compressed 9108363519 bytes into 8992064047 bytes ==> 98.72% % ./fse -h ./almost-uncompressable Compressed 9108363519 bytes into 8943423537 bytes ==> 98.19% % ./fse -z ./almost-uncompressable Compressed 9108363519 bytes into 8944678105 bytes ==> 98.20%

Granted that I don't know the intimate details of FSE and this is a near-pathological case, but I'd have expected the two huffman implementations to fare rather worse than FSE on these almost-but-not-quite-uniform distributions of data.

Am I wrong?

Cheers.

adamdmoss avatar Jul 17 '20 21:07 adamdmoss

This can happen.

The difference cannot be "huge", as it's essentially concentrated into the header format : fse needs more information than huffman, so its header is larger.

After that, both implementation should provide similar compression ratio, except when one or more symbols start to feature a "dominant" probability (>10 %), in which case, fse starts to produce some benefits, larger as the most dominant symbols takes a larger share (though huffman nullify this advantage if dominant symbols represent exactly 25% or 50%).

In this case, I see a file with very poor compression ratio. It follows that none of the symbols (byte values) is dominant. In which case, huffman and fse will produce similar compression ratios, with fse being slightly disadvantage by the heavier header.

Cyan4973 avatar Jul 17 '20 22:07 Cyan4973

Thanks again for the explanation - makes sense! Didn't realize FSE header was nontrivial.

adamdmoss avatar Jul 17 '20 22:07 adamdmoss