zstd
zstd copied to clipboard
Store uncompressed size of whole file
Currently "zstd --list" iterates all compressed blocks, which makes it super slow when lseek(2) is expensive, e.g., on a network filesystem.
Can zstd store uncompressed file size directly like xz, so that listing becomes independent on file size?
Such a field is not part of the format. There's technically no way for a streamer to know the future size of an entire frame at the moment the header is generated.
Well, I think a seek table is not as fast but sufficient for quickly listing uncompressed size: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md
Oh, btw,
uncompressed
frame size is part of the format,
it's not the reason why zstd --list
seeks into the file.
The issue is that a file may consist of multiple concatenated frames. In order to know that, the only way is to reach the end of the (compressed) frame, and see if there is more data. If there is, start decoding again, and aggregate the results. If there is not, stop there.
So, in situations where file has only one frame, fseek()
work happens but ends up being not useful.
But before reaching the end of the file, there's no way to know a priori that the file consists of a single frame, since appending is a decision that can happen much later,
and retuning the uncompressed
size of the first frame would be misleading if there are several appended frames in the file.