flate2-rs icon indicating copy to clipboard operation
flate2-rs copied to clipboard

flate2 ZLibDecoder reads too much data from files.

Open kallisti5 opened this issue 2 years ago • 3 comments
trafficstars

The flate2 ZLibDecoder seems to read too much data and advance file pointers too far.

Parsing a raw file filled with individual hunks of zlib compressed data:

  • 80-22216 Zlib Chunk 1
    • dd if=sample/ctags_source-5.8-5-source.hpkg bs=1 skip=80 count=22137 of=test1
    • file test1 ; test1: zlib compressed data
  • 22217-44314 Zlib Chunk 2
    • dd if=sample/ctags_source-5.8-5-source.hpkg bs=1 skip=22217 count=22097 of=test2
    • file test2 ; test2: zlib compressed data

I can validate these chunks:

  • cat test1 | zlib-flate -uncompress > test1.uncompressed
  • cat test2 | zlib-flate -uncompress > test2.uncompressed
[kallisti5@eris hpkg-rs]$ ls -lah test1.uncompressed 
-rw-r--r-- 1 kallisti5 users 64K Feb 28 09:19 test1.uncompressed
[kallisti5@eris hpkg-rs]$ ls -lah test2.uncompressed 
-rw-r--r-- 1 kallisti5 users 64K Feb 28 09:20 test2.uncompressed

However.. when I try to decode these chunks with flate2 / ZlibDecoder..

       /// Inflate the heap section of a hpkg for later processing
        /// XXX: This will likely need reworked... just trying to figure out what's going on
        fn inflate_heap(&mut self) -> Result<usize, Box<dyn error::Error>> {
                let header = self.header.as_ref().unwrap();
                let filename = self.filename.as_ref().unwrap();

                println!("header: Heap chunk size: {}", header.heap_chunk_size);
                println!("header: Heap compressed size: {}", header.heap_size_compressed);
                println!("header: Heap uncompressed size: {}", header.heap_size_uncompressed);
                println!("header: Heap compression: {}", header.heap_compression);

                let mut f = File::open(filename)?;

                let mut pos = header.header_size as u64;
                f.seek(SeekFrom::Start(pos))?;

                while pos < header.heap_size_compressed {
                        println!("Seek from file {}, heap {}", pos, pos - header.header_size as u64);
                        let mut reader: Box<dyn Read> = match header.heap_compression {
                                0 => Box::new(&f),
                                1 => Box::new(ZlibDecoder::new(&f)),
                                2 => Box::new(zstd::stream::read::Decoder::new(&f)?),
                                _ => return Err(From::from(format!("Unknown hpkg heap compression: {}", header.heap_compression)))
                        };
                        let mut buffer = vec![0; header.heap_chunk_size as usize];
                        reader.read_exact(&mut buffer)?;
                        self.heap_data.push(buffer);
                        pos = (&f).stream_position()?;
                }
                Ok(0)
        }
---- package::tests::test_package_load_valid stdout ----
header: Heap chunk size: 65536     (uncompressed yo)
header: Heap compressed size: 501432
header: Heap uncompressed size: 1988947
header: Heap compression: 1
Seek from file 80, heap 0
Seek from file 32848, heap 32768
ERROR: corrupt deflate stream
thread 'package::tests::test_package_load_valid' panicked at 'assertion failed: false', src/package.rs:297:17
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

read_exact produces 64k uncompressed as expected, but the file pointer is moved 32k in the file vs the expected 22057 bytes. (22057 marks the "end" of the compressed data stream and the start of the next 0x78, 0xDA)

kallisti5 avatar Feb 28 '23 15:02 kallisti5

I feel like the solution is once ZLibDecoder reads the final 32k chunk, it should "back the source reader to the end of the zlib stream"?

This would enable users to know where the last compressed stream ended within a file.

kallisti5 avatar Feb 28 '23 15:02 kallisti5

https://github.com/rust-lang/flate2-rs/blob/main/src/deflate/read.rs#L161 feels like the source of the guaranteed 32k read from the files.

kallisti5 avatar Feb 28 '23 16:02 kallisti5

I think this is the same problem as #367 except that that issue is for gzip. Essentially, this is expected behavior for the read interfaces that actually wrap the Read type in a new std::io::BufReader for each decoder.

To fix, wrap the File in a BufReader once and you can then pass it to multiple bufread::ZlibDecoder instances.

This will, however, also make stream_position incorrect, even if you can access it, so you will need a different way to know when to terminate the loop.

jongiddy avatar Jul 30 '23 16:07 jongiddy