zstd
zstd copied to clipboard
Optimal parser edge case: Huffman alone is significantly better than any matches
a. I tried to compress with higher lever and got a same result b. If I have few alphabet symbols (less then 64 (6 bits Huffman length) - I took only 2!!) - Huffman code for every symbol takes less the 8 bits even when I have a uniform probability each of them. so, for big enough literal stream we get a better compression then raw (even we add a tree).
Originally posted by @shulib in https://github.com/facebook/zstd/issues/3203#issuecomment-1200412845
@shulib can you provide an example file?
example file data (hexdump): 00000000 02 02 02 02 02 02 02 01 02 02 02 02 02 02 02 02 |................| 00000010 01 01 01 02 01 01 01 01 02 01 01 02 02 01 01 02 |................| 00000020 01 02 02 02 01 02 02 02 01 02 01 01 01 02 01 02 |................| 00000030 01 01 01 02 01 01 02 01 01 01 02 02 01 01 02 01 |................| 00000040 02 02 01 02 02 02 02 01 02 01 02 01 02 01 02 02 |................| 00000050 01 02 02 01 01 02 01 02 02 01 01 02 02 01 01 02 |................| 00000060 01 01 01 02 01 01 02 01 |........| 00000068 when I compress it "manually" I got 00000000 28 b5 2f fd 20 68 a5 00 00 82 06 04 81 01 12 99 |(./. h..........| 00000010 65 ab de 32 12 45 77 99 10 ff fe 01 00 |e..2.Ew......| 0000001d
when I compress it by tool I got: 00000000 28 b5 2f fd 20 68 bd 01 00 a8 02 02 01 02 01 01 |(./. h..........| 00000010 01 02 02 02 02 01 02 02 01 01 02 01 01 02 01 0f |................| 00000020 00 44 65 d0 b9 07 8c 81 59 51 36 84 2a d0 05 a4 |.De.....YQ6.*...| 00000030 1c c4 6c 77 e7 58 65 11 65 c3 6e 0d e0 b0 00 b0 |..lw.Xe.e.n.....| 00000040
I cannot drug the file - so I put data.
this is the dat try78__1.zip a zip results
Thanks for the report @shulib!
This is a case where using Huffman only outperforms zstd. Zstd finds matches, but it is actually better to have no matches.
E.g. with the input file provided test.xxd:
00000000: 0202 0202 0202 0201 0202 0202 0202 0202 ................
00000010: 0101 0102 0101 0101 0201 0102 0201 0102 ................
00000020: 0102 0202 0102 0202 0102 0101 0102 0102 ................
00000030: 0101 0102 0101 0201 0101 0202 0101 0201 ................
00000040: 0202 0102 0202 0201 0201 0201 0201 0202 ................
00000050: 0102 0201 0102 0102 0201 0102 0201 0102 ................
00000060: 0101 0102 0101 0201 ........
Run these commands:
> xxd -r < test.xxd | ./zstd -19 --no-check -c | wc -c
54
> xxd -r < test.xxd | ./zstd --fast=100 --no-check --compress-literals -c | wc -c
29
This corner case should be fixed in #3391