zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Multiple regressions (v1.5.6 -> v1.5.7 most significant)

Open iczelia opened this issue 1 month ago • 2 comments

Dear All,

Take https://github.com/KirillKryukov/naf/, a pretty straightforward use of zstandard to compress nucleotide data. My test file is GCA_004837865.1.fna at around 500M, you can find it in a couple of places across the internet. The issue is only noticeable at high compression levels. I timed and ran ennaf/ennaf --temp-dir . -19 GCA_004837865.1.fna overnight, checking out every version between v1.5.0 and v1.5.7 and rebuilding the code from scratch - this is the only thing that I changed.

I observed this:

v1.5.0: 85.98s average, 97141591 bytes.
v1.5.1: 85.50s average, 97204512 bytes.
v1.5.2: 86.49s average, 97204512 bytes.
v1.5.4: 87.76s average, 97204987 bytes.
v1.5.5: 87.20s average, 97204987 bytes.
v1.5.6: 88.63s average, 97941622 bytes.
v1.5.7: 88.73s average, 98103951 bytes.

Given the original size (after 4-bit coding) of 249586605 bytes, we see the steady regression from 3.113 bits per byte to 3.144 bits per byte, as well as a statistically significant ~3% slow-down.

I have not identified the sources of these regressions, as there are multiple.

iczelia avatar Dec 11 '25 21:12 iczelia

It's not simple to analyze. These algorithms are full of heuristics. They tend to work fine on some datasets, but can backfire on other ones. In most cases, there isn't a universally better choice, just some "generally better" ones.

One could argue that the largest regression appeared during the 1.5.5->1.5.6 transition (~740KB). v1.5.6 introduced a big improvement for 32-bit structures: https://github.com/facebook/zstd/pull/3895 . But, once again, this was good for the targeted structures, but possibly bad for other ones. At the time this PR was written, it seems no counter example was found, so it felt relatively safe to land. But it could be that the intermediate naf format is such a counter example.

1.5.6 -> 1.5.7 introduced a block splitter, which is also generally positive, though the impact is quite small at level 19. It's a good fit when statistics vary within the file, whereas it's not useful if statistics remain relatively homogeneous across the entire document. The algorithm should detect when it's not useful and disable itself, resulting in very small losses for the "bad" cases. In this specific sample, the losses are about ~160KB, which represents 0.16%. I admit it's on the higher side, though not completely unreasonable.

The better way is to analyze the data to understand the pattern, but ennaf doesn't produce this intermediate stage that is then fed to zstd, which makes such an analysis difficult.

Cyan4973 avatar Dec 11 '25 22:12 Cyan4973

@Cyan4973

It's not simple to analyze. These algorithms are full of heuristics.

Most certainly, I agree. However, the specifics of the problem at hand make it much more delicate. One issue, however, is that one could be tempted to pin an old version, due to its objective superiority in measurable characteristics, and expose oneself to all sorts of attacks - there are only two types of codecs: the safe ones and the fast ones.

The loss of one megabyte is important, because the difference between zstd -19 and zstd -1 on this 500 megabyte file is only about 9 megabytes. Of course, one of these is two orders of magnitude faster than the one. Tradeoffs start blurring and a couple of version bumps later it starts turning out that zstandard may not be so good of a choice anymore.

iczelia avatar Dec 11 '25 23:12 iczelia