zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Optimal parser edge case: Huffman alone is significantly better than any matches

Open shulib opened this issue 3 years ago • 6 comments

a. I tried to compress with higher lever and got a same result b. If I have few alphabet symbols (less then 64 (6 bits Huffman length) - I took only 2!!) - Huffman code for every symbol takes less the 8 bits even when I have a uniform probability each of them. so, for big enough literal stream we get a better compression then raw (even we add a tree).

Originally posted by @shulib in https://github.com/facebook/zstd/issues/3203#issuecomment-1200412845

shulib avatar Jul 31 '22 12:07 shulib

@shulib can you provide an example file?

terrelln avatar Aug 01 '22 18:08 terrelln

example file data (hexdump): 00000000 02 02 02 02 02 02 02 01 02 02 02 02 02 02 02 02 |................| 00000010 01 01 01 02 01 01 01 01 02 01 01 02 02 01 01 02 |................| 00000020 01 02 02 02 01 02 02 02 01 02 01 01 01 02 01 02 |................| 00000030 01 01 01 02 01 01 02 01 01 01 02 02 01 01 02 01 |................| 00000040 02 02 01 02 02 02 02 01 02 01 02 01 02 01 02 02 |................| 00000050 01 02 02 01 01 02 01 02 02 01 01 02 02 01 01 02 |................| 00000060 01 01 01 02 01 01 02 01 |........| 00000068 when I compress it "manually" I got 00000000 28 b5 2f fd 20 68 a5 00 00 82 06 04 81 01 12 99 |(./. h..........| 00000010 65 ab de 32 12 45 77 99 10 ff fe 01 00 |e..2.Ew......| 0000001d

when I compress it by tool I got: 00000000 28 b5 2f fd 20 68 bd 01 00 a8 02 02 01 02 01 01 |(./. h..........| 00000010 01 02 02 02 02 01 02 02 01 01 02 01 01 02 01 0f |................| 00000020 00 44 65 d0 b9 07 8c 81 59 51 36 84 2a d0 05 a4 |.De.....YQ6.*...| 00000030 1c c4 6c 77 e7 58 65 11 65 c3 6e 0d e0 b0 00 b0 |..lw.Xe.e.n.....| 00000040

shulib avatar Aug 01 '22 20:08 shulib

I cannot drug the file - so I put data.

shulib avatar Aug 01 '22 20:08 shulib

try78__1_2.zip

shulib avatar Aug 01 '22 20:08 shulib

this is the dat try78__1.zip a zip results

shulib avatar Aug 01 '22 20:08 shulib

Thanks for the report @shulib!

This is a case where using Huffman only outperforms zstd. Zstd finds matches, but it is actually better to have no matches.

E.g. with the input file provided test.xxd:

00000000: 0202 0202 0202 0201 0202 0202 0202 0202  ................
00000010: 0101 0102 0101 0101 0201 0102 0201 0102  ................
00000020: 0102 0202 0102 0202 0102 0101 0102 0102  ................
00000030: 0101 0102 0101 0201 0101 0202 0101 0201  ................
00000040: 0202 0102 0202 0201 0201 0201 0201 0202  ................
00000050: 0102 0201 0102 0102 0201 0102 0201 0102  ................
00000060: 0101 0102 0101 0201                      ........

Run these commands:

>  xxd -r < test.xxd | ./zstd -19 --no-check -c | wc -c
54
> xxd -r < test.xxd | ./zstd --fast=100 --no-check --compress-literals -c | wc -c
29

terrelln avatar Dec 20 '22 02:12 terrelln

This corner case should be fixed in #3391

Cyan4973 avatar Dec 22 '22 01:12 Cyan4973