zstd
zstd copied to clipboard
[edge case] Zstd performs badly on 200-symbol uniform data
Data generated by this script:
import random
rd = random.Random()
rd.seed(0)
HIGH_ENTROPY = bytes(rd.randint(0, 200) for _ in range(10_000_000)) * 10
with open("med.bin", "wb") as f:
f.write(HIGH_ENTROPY)
gzip -1: 100000000 -> 96526120 zstd -1: 100000000 -> 100002299
If I remove these heusistics:
https://github.com/facebook/zstd/blob/b7b7edb3a3017ac8e16d7eb2dbede45168560c58/lib/compress/huf_compress.c#L1297 https://github.com/facebook/zstd/blob/b7b7edb3a3017ac8e16d7eb2dbede45168560c58/lib/compress/huf_compress.c#L1303
We get:
zstd -1: 100000000 -> 96449637
Zstd should do a better job with determining compressibility so we don't lose out on this case.
I think we should be able to remove these heuristic and replace them with an actual Shannon entropy estimation. As I already have a pretty precise and fast estimator, we can try it here. I believe runtime shouldn't change much, given that we already calculate the histogram and its in cache.
aside: this is 201-symbol data; randint is inclusive of both params.