sevenz-rust icon indicating copy to clipboard operation
sevenz-rust copied to clipboard

Decompressing 7z+ZSTD is missing entries

Open msumner91 opened this issue 7 months ago • 0 comments

Context:

  • I compress a folder using py7zr >= 0.21.0:
    ZSTD_FILTER = [{"id": FILTER_ZSTD, "level": ZSTD_COMPRESSION_LEVEL}]
    with SevenZipFile(<path-to-7z>, mode="w", filters=ZSTD_FILTER) as zst_handle:
        for root, dirs, files in os.walk(<input-path>):
            for node in files + dirs:
                zst_handle.write(os.path.join(root, node), os.path.relpath(os.path.join(root, node), <input-path>))
  • Using sevenz-rust = { version = "0.6.1", features = ["zstd"] } I then try to decompress this file using sevenz-rust sevenz_rust::decompress_file(7z_file, &args.dest)
  • This results in a 'successful' extraction, but it is actually missing a series of files in the directories
  • This archive can be extracted without issue using the 7z utility so it appears to be well formatted.

Debugging:

  • Debugging this with a local version of sevenz-rust shows that this loop in reader.rs is not iterating all of the files
  • Changing this line to for file_index in start..(archive.files.len() + start) fixed this in my case
  • In this archive self.archive.folders.len() = 1 so we only poke folder_dec.for_each_entries once
  • However, this does not iterate all the files because the file_count (computed by archive.folders[folder_index].num_unpack_sub_streams) appears to be too low:
file_count=204
archive.files.len() = 233
  • It looks like file_count is read from the archive numStreams header on this line and is indeed 204 in this case not 233

Questions / thoughts:

  • Is it reliable to rely on the numStreams header for determining the loop iterations?
  • Is there a better way to determine the loop iterations?

Any help / advice would be much appreciated

msumner91 avatar Jul 23 '24 15:07 msumner91