zstd [edge case] Zstd performs badly on 200-symbol uniform data

[edge case] Zstd performs badly on 200-symbol uniform data

Open terrelln opened this issue 3 years ago • 2 comments

Data generated by this script:

import random
rd = random.Random()
rd.seed(0)
HIGH_ENTROPY = bytes(rd.randint(0, 200) for _ in range(10_000_000)) * 10
with open("med.bin", "wb") as f:
  f.write(HIGH_ENTROPY)

gzip -1: 100000000 -> 96526120 zstd -1: 100000000 -> 100002299

If I remove these heusistics:

https://github.com/facebook/zstd/blob/b7b7edb3a3017ac8e16d7eb2dbede45168560c58/lib/compress/huf_compress.c#L1297 https://github.com/facebook/zstd/blob/b7b7edb3a3017ac8e16d7eb2dbede45168560c58/lib/compress/huf_compress.c#L1303

We get:

zstd -1: 100000000 -> 96449637

Zstd should do a better job with determining compressibility so we don't lose out on this case.

Jun 16 '22 18:06 terrelln

I think we should be able to remove these heuristic and replace them with an actual Shannon entropy estimation. As I already have a pretty precise and fast estimator, we can try it here. I believe runtime shouldn't change much, given that we already calculate the histogram and its in cache.

Jun 16 '22 22:06 yoniko

aside: this is 201-symbol data; randint is inclusive of both params.

Jul 18 '22 19:07 thatch

zstd zstd copied to clipboard

[edge case] Zstd performs badly on 200-symbol uniform data

zstd
zstd copied to clipboard