zstd icon indicating copy to clipboard operation
zstd copied to clipboard

Store uncompressed size of whole file

Open Kawashima-Azumi opened this issue 3 years ago • 3 comments

Currently "zstd --list" iterates all compressed blocks, which makes it super slow when lseek(2) is expensive, e.g., on a network filesystem.

Can zstd store uncompressed file size directly like xz, so that listing becomes independent on file size?

Kawashima-Azumi avatar Dec 16 '21 12:12 Kawashima-Azumi

Such a field is not part of the format. There's technically no way for a streamer to know the future size of an entire frame at the moment the header is generated.

Cyan4973 avatar Dec 16 '21 17:12 Cyan4973

Well, I think a seek table is not as fast but sufficient for quickly listing uncompressed size: https://github.com/facebook/zstd/blob/dev/contrib/seekable_format/zstd_seekable_compression_format.md

Kawashima-Azumi avatar Dec 17 '21 11:12 Kawashima-Azumi

Oh, btw, uncompressed frame size is part of the format, it's not the reason why zstd --list seeks into the file.

The issue is that a file may consist of multiple concatenated frames. In order to know that, the only way is to reach the end of the (compressed) frame, and see if there is more data. If there is, start decoding again, and aggregate the results. If there is not, stop there.

So, in situations where file has only one frame, fseek() work happens but ends up being not useful. But before reaching the end of the file, there's no way to know a priori that the file consists of a single frame, since appending is a decision that can happen much later, and retuning the uncompressed size of the first frame would be misleading if there are several appended frames in the file.

Cyan4973 avatar Mar 21 '22 21:03 Cyan4973