zstd icon indicating copy to clipboard operation
zstd copied to clipboard

spec: 0-sized compressed blocks permitted by spec

Open klauspost opened this issue 3 years ago • 9 comments

Describe the bug

The current decompression code fails if the remainder of a block is sent to the literal decoder with only 2 bytes.

With 0 literals and 0 sequences that condition can be met.

While it is easy to argue that a block with such encoding is pointless (it should just be a raw block), I don't see any mention of it in the spec, and as far as I can tell it is conforming to spec. Ofc. I might have missed it.

Example: 2274d31e0d569fe9e31bedc4f9fddd9c9f114c2f.zst.gz

klauspost avatar Mar 09 '22 13:03 klauspost

Interesting, we have a harness test program which is supposed to generate all sorts of valid blocks, and is a good tool to validate compliance of decoders. It should have been able to generate such empty block. We probably need to update this test program if it doesn't.

Cyan4973 avatar Mar 09 '22 15:03 Cyan4973

Thanks for the report!

We'll fix this bug in the next release.

We've had other bugs in the decoder that rejected valid frames. I will put up a PR shortly with a document that lists all the known cases where we've rejected a valid frame, an example frame, and the version we fixed the issue.

terrelln avatar Mar 09 '22 19:03 terrelln

Actually, this block used to be invalid. Older versions of the spec had this clause:

A Compressed_Block has the extra restriction that Block_Size is always strictly less than the decompressed size. If this condition cannot be respected, the block must be sent uncompressed instead (Raw_Block).

We removed that clause because the decoder didn't enforce it. And because it made certain topics (namely block splitting) a lot simpler. Since there are cases where you'd want to emit a small block that had large headers, but you'd then re-use those headers in later blocks to amortize the cost.

However, it looks like we missed the edge case where compressed block is 0 sized.

terrelln avatar Mar 09 '22 19:03 terrelln

Actually, this block used to be invalid.

That explains why it wasn't part of harness. Indeed, subtle consequence of a small change !

Cyan4973 avatar Mar 09 '22 19:03 Cyan4973

This frame is the simplest test case. The only difference from the one you provided is that yours has the Unused_Bit of the frame header set to 1.

28b5 2ffd 2000 1500 0000 00

terrelln avatar Mar 09 '22 20:03 terrelln

The one liner fix is to change MIN_CBLOCK_SIZE from 3 to 2.

However, that definitely breaks some bounds checks in ZSTD_decodeLiteralsBlock(), and maybe elsewhere. So more care is needed.

terrelln avatar Mar 09 '22 20:03 terrelln

See https://github.com/facebook/zstd/pull/3092 for the decompressor errata document.

terrelln avatar Mar 09 '22 22:03 terrelln

@terrelln MIN_CBLOCK_SIZE = 3 is correct for blocks, but for a compressed literal section it isn't, since at least one byte has already been read from the block.

So I think MIN_CBLOCK_SIZE should be kept at 3, and either MIN_CBLOCK_SIZE - 1 or a new MIN_LITSEC_SIZE = 2 should be added.

@Cyan4973 Yeah, the input is from fuzz testing, so I am not surprised there are some wonky bits in there :)

klauspost avatar Mar 10 '22 10:03 klauspost

@klauspost MIN_CBLOCK_SIZE isn't supposed to be including the block header (at least in the way it is used in the literals section decoding). In this case the block size is 2 bytes. 1 byte for the literals section, and 1 byte for the sequences section.

terrelln avatar Mar 12 '22 03:03 terrelln