Some questions about Zstd
Some questions about Zstd
Huffman Literals: what is the minimum header bytes? Huffman Fse: what is the minimum weights number should we put (include missing weight)? in spec 4.Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size. Note: if Stream1_Size + Stream2_Size + Stream3_Size > Total_Streams_Size, data is considered corrupted. is the correct note is: total_stream_size < stream1+stream2+stream3 sizes +6+1? ( 1 because minimum stream 4 byte is 1)? Lit stream: is regenerated size should be zero? is compressed size should be zero? Fse sequence - should we configured by mistake offset size 0 ? Block size: is 128k limitation for compressed sizes: it include block header (3 bytes) or not? Thanks, Shuli,
4.Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size. Note: if Stream1_Size + Stream2_Size + Stream3_Size > Total_Streams_Size, data is considered corrupted. is the correct note is: total_stream_size < stream1+stream2+stream3 sizes +6+1? ( 1 because minimum stream 4 byte is 1)?
reworked : I initially misinterpreted the question by focusing on decompressed size. I now understand the question relates to the compressed size.
Yes, that's correct. A bitstream of Huffman-compressed symbols is at least 1 byte large, even if it doesn't contain any symbol, because each stream must have at least one bit for the end-flag, therefore, even an empty stream has at least 1 bit, hence must occupy at least 1 byte.
As an example, the following line of code states the same thing :
if (cSrcSize < 10) return ERROR(corruption_detected); /* strict minimum : jump table + 1 byte per stream */
Block size: is 128k limitation for compressed sizes: it include block header (3 bytes) or not?
No, it doesn't.
Huffman Literals: what is the minimum header bytes?
It we assume the presence of a Literals Section, which means the current block is a compressed block, and if we assume that the literals are compressed using Huffman prefix code, then the minimum header size of the Literals Section is 3 bytes. It can grow up to 5 bytes.
This is explained in detail in : https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#literals_section_header
Lit stream: is regenerated size should be zero? is compressed size should be zero?
I assume you mean "could these sizes be zero", and yes, regenerated size can be zero.
Of course, compressing a 0-size literals section with Huffman is silly and wasteful, one should rather employ a raw-content literals section, with a regenerated size of zero, which would only employ 1 byte of header. But from a decoder stand point, this scenario is not forbidden.
Note that compressed size cannot be zero, since even in a single-stream scenario, it stills needs to be at least 1 byte large to contain the final 1-bit end-flag.
In 4-streams scenarios, a valid compressed size is necessarily >= 10.
Hi, I would like to add more details on some of my questions so it will be more clear.
- what is the minimum "HeaderByte" field value when using literals that compressed by FSE? (we think it should be 2B because at least 1B consumed by FSE table and at least 1B used for FSE bitstream)
- what is the minimum amount of weights for literals compressed by FSE? (we think 3 weights include the missing weight. Because 2 weights are mandatory for constucting FSE table) 3.does the "block_size" field include the 3B of block header ? So, actually block size can be 128K + 3B? "regenerates_size" and "compressed_size" fields have more bits than necessary to represent 128K size. Do the rest of the MSB bits should be reserved? Thanks,
- what is the minimum "HeaderByte" field value when using literals that compressed by FSE? (we think it should be 2B because at least 1B consumed by FSE table and at least 1B used for FSE bitstream)
This sentence is confusing, because literals are not compressed with FSE. Literals are compressed using Huffman. Then, the Huffman's header (when present) is compressed with FSE. Is that what you mean ?
minimum value of header byte - when I use FSE in literal part.
header byte - the size of tree/ huffman weights in literal part
OK, so this is the Huffman tree header.
Indeed, when weights are compressed with FSE, there is necessarily a minimum length to represent them. I expect this minimum length to be 3 bytes, 1 byte for the FSE table, and 2 bytes for the FSE bitstream.
That's because the default FSE implementation uses a double-state strategy, and requires state width to be a minimum of 5 bits. So a bitstream will start with 2 full-length states, hence 10 bits. +1 bit for end-flag. This requires at least 2 bytes.
On the FSE table side, one could imagine only 2 symbols, the first one using almost the entire probability space, requiring 5 bits, and the second one taking the rest, requiring 2 bits. So it would fit in a single byte.
These are limit scenarios. Real-world use cases are generally far from these limits.
Dear @Cyan4973, I hope you're doing well, let me bother you with a question.
zstd 1.5.5, Windows 7 x64
$ zstd "2024-0124-1201 Index - Belföld - Kárpátaljai menekültek zaklatják a fiatal lányokat Kerepesen.html"
zstd: can't stat 2024-0124-1201 Index - Belfold - Karpataljai menekultek zaklatjak a fiatal lanyokat Kerepesen.html : No such file or directory -- ignored
The program probably doesn't like the presence of some characters in the file name. What can I do other than rename the file (I have a lot of them)?
No idea.
zstd doesn't do anything special, it uses the C standard fopen() function.
This should be naturally compatible with any utf-8 character set.
But I suspect Windows does something different with the character set,
so there might be a need to employ other non-portable Windows variants of fopen for these scenarios.
Renaming your file using ascii characters will likely fix this issue, but I presume you'd rather not.