zstd
zstd copied to clipboard
issues found while fuzzing to compare behaviour of Zig decoder to this repo
We've been doing some fuzz tests of the a Zig implementation of Zstandard decoding we're working on that compare behaviour with this repo (often referred to as 'the reference' below). We've encountered a few cases where the behaviour differs that I think we need some help deciding if we have a bug, we've found a bug in the reference, or if we just have differences in how permissive the two implementations are in some edge cases.
There are 4 cases so far - if you'd prefer I split them into different issues let me know and I can do that and edit this one into just one of them. Any help diagnosing the issues below would be appreciated - either another pair of eyes that can verify my findings, telling me if I'm misreading the docs, or being pointed to any tools in this repo that can help pull apart some of the more complex inputs below would be helpful.
Case 1
Input:
00000000: 28b5 2ffd c40c 3500 0000 0000 0000 4500 (./...5.......E.
00000010: 00e9 9501 5404 0215 011f 0ccd 46 ....T.......F
This is a frame with a single compressed block where the literals section is RLE of byte 0x95 and the sequence section contains RLE literal, match and offset codes and a single sequence. The issue looks like the offset code is 0x02 which (according to the docs means 2 bits should be read from the FSE bit-stream to get the offset value. However, the FSE bit-stream is the single byte 0x01, which has 0 bits available (i.e. it's just the first byte and only contains the padding bits and start-of-stream marker). I think this input is malformed, but could be made well-formed by changing the offset code bytes to be 0x00 so that no bits are read from the bit-stream while still resulting in an offset of 1.
The reference decompresses this input into 53 0x95 bytes, which is what the Zig implementation does if the offset code byte is modified as mentioned. My interpretation is that the reference isn't non-conformant (unless it's a bug that causes problems on other inputs) due to this, just more permissive, and the Zig implementation is being compliant while rejecting this input.
Case 2
Input: the file compare-alloc/id:000032,sig:06,src:000123,time:1670472,execs:7740992,op:havoc,rep:2 in this zip file.
It's a bit too long to look at by hand I think, but the issue seems to be in the Huffman tree decoding (that is at least what the reference reports).
A collaborator of mine stepped through the error we were getting in the reference with a debugger and found it was due to it returning corruption_detected here because rankStats[1] is 0 at that point. I am not familiar with the reference implementation code so I might be misinterpreting what I'm reading there, but it looks like the reference code is saying that a Huffman tree needs to have an even number (and at least 2) symbols of weight 1 - I don't see anything in the docs that says this.
I'm not really sure about this one - I think either I'm missing something in the docs or it's a bug in the reference.
While investigating this issue, I also noticed that there is a check that the total weight is not zero (before considering the implied extra weight) which I don't see the docs say is an error either. This hasn't caused any differences between the Zig and reference implementations that I've seen.
Edit: I don't think I was clear above, but this case is successfully decoded by the Zig implementation.
Case 3
I've got 2 inputs:
28b5 2ffd 9019 ff7f fdc4 0500 00
and
28b5 2ffd 0000 0000 0005 0000
The first of these contains a single compressed block and the second contains an empty raw block followed by a compressed block. For both the compressed block is a block with block header 0x05 0x00 0x00 (the last three bytes of each input). The reference decodes these into zero bytes, but my reading of the docs is that they are malformed frames and that a compressed block should always have a content size of at least 2 (the literals and sequences headers are required and each use at least 1 byte).
I might be misinterpreting the first sentence or the section on compressed blocks which might be supposed to mean that a compressed block can have a Block_Size of zero (or one, but that would be confusing - what would that single byte mean?) - if this is the case I think the docs could be clarified. If I'm interpreting the docs correctly, then I think the reference is also compliant and is just more permissive here.
Case 4
Input:
00000000: 28b5 2ffd e4bc 0000 0000 0000 003d 0200 (./..........=..
00000010: 464b 0b0b e019 0e02 5871 30d1 0566 0907 FK......Xq0..f..
00000020: 0007 0007 00df ee7b ffff fb0f cfbf f7fd .......{........
00000030: feff 1bff f7bb ffef ff03 ffff fbbf ffff ................
00000040: 02a8 1430 e0ff 9fff 13b0 81ff 1430 80fd ...0.........0..
00000050: 3ad1 0200 0601 023e 1c33 5a :......>.3Z
This input is compare-stream/MalformedBlock/id:000019,sig:06,src:000142,time:24158073,execs:33704590,op:havoc,rep:2 in this zip file, and the output from our fuzzer, which includes logs from the reference implementation can be found in the file compare-stream/MalformedBlock/output.txt if anyone wants to look.
I have a less complete understanding of the issue in this frame as I think the issue is in the decoding of FSE tables in the sequences section. The following might be a bit verbose, but as I'm less sure of exactly what the issue is I though I'd better not elide anything I found while trying to diagnose the issue.
This is a single compressed block of (compressed) size 71. The literals section is type Compressed_Literals_Block with regenerated size 180, and compressed size 45. The Zig and reference implementations agree up to this point (the reference logs say ZSTD_decodeLiteralsBlock : cSize=48, nbLiterals=180 and I'm interpreting cSize to include the 3 bytes of the literals section header which is the bytes 0x46 4b 0b at offset 0x10).
The sequences section starts at offset 0x40, so there are 2 sequences and both implementations seem to be in sync up to here. The compression mode byte is a8, so all three table modes are FSE_Compressed_Mode. The Zig and reference implementations get the same first sequence (in terms of the literal_length, match_length and offset, but an error is reported by the Zig implementation indicating that there are not enough bits in the bit-stream to update the FSE states for the second sequence. The reference logs I've seen don't give enough information about how many bits some operations are consuming to completely tell how it's consuming the bits for the FSE tables or while decoding sequences (maybe there are some other flags we can turn on?).
This leads me to believe the issue (either in Zig or reference) lies somewhere in decoding the sequence FSE tables, with the Zig implementation consuming more bytes than the reference, causing the reverse FSE bit-stream to be shorter and run out of bits while the reference does not. The Zig implementation thinks that the reverse FSE bit-stream is the bytes 0xd1 02 00 06 01 02 (starting at offset 0x51) and it runs out while trying to read bits to update the offset state (all states apparently want to read 1 bit in order to update). I'm not sure how much longer the reference thinks the reverse FSE bit-stream is, but it says the second sequence is seq: litL=0, matchL=5, offset=125.
I haven't tried to decode the FSE tables by hand, (which would probably be my next step), hoping that either there might be a tool you trust to help me do this or someone has an idea what the problem might be/a better way to proceed.