stream-unzip icon indicating copy to clipboard operation
stream-unzip copied to clipboard

Infinite loop using stream_unzip()

Open DwayneDriskill opened this issue 1 year ago • 12 comments

Hi,

I am using stream-unzip 0.0.91 to unzip a file that was zipped using the deflate64 algorithm. The file was zipped using Windows 11's built in zipper/unzipper. I am able to unzip the file using Windows 11's built in zipper/unzipper. The files look fine after being unzipped so I don't think the files are corrupt. Unfortunately, I cannot share the file because it contains data that I am not allowed to share. I can tell you that the zipped file contains 7 .json files. 6 of the files unzip without any issues. stream_unzip() gets stuck in an infinite loop when trying to unzip the 7th file, which is over 2GB. I confirmed this with logging and by letting the program run for several hours to confirm that the size of the unzipped file was not increasing and that stream_unzip() was iterating over the same code. I stepped into the code for stream_unzip(). But I don't have time to understand the code well enough myself to get to the point where I can pinpoint the issue and it's using yield in several places, and I have never used yield myself. My code is below. Perhaps I'm making a mistake in how I'm using stream_unzip(). def _unzip_file_using_stream_unzip(self, file_path, temp_dir): with open(file_path, 'rb') as f: seen_files = set() # to search for duplicate file names. This is specific to my use case. Theoretically, a zip file can have 2 files with the same name in different folders and my code doesn't handle that scenario. unzipper = stream_unzip(f) for file_info in unzipper: file_name, file_size, file_content_gen = file_info file_name = file_name.decode('utf-8') if not file_name.endswith('/'): # I borrowed this code from a fxn I wrote to unzip files using the python standard library. I'm using Python 3.11. This identifies a folder. I'm not sure if stream_unzip() works the same way. But the code is not getting stuck here. if os.path.basename(file_name.lower()) in seen_files: raise RuntimeError("Error: There are multiple files named '" + f"{os.path.basename(file_name)}' in {file_path}.") seen_files.add(os.path.basename(file_name.lower())) extracted_file_path = os.path.join(temp_dir, os.path.basename(file_name)) with open(extracted_file_path, 'wb') as out_file: buffered_writer = io.BufferedWriter(out_file, buffer_size=65536) # I proved with logging that when PROD_04182024_EXPORTS.zip is passed in, # this code gets stuck in an infinite loop (in the for chunk loop). 6 of the 7 files # are successfully unzipped. The 7th file is the one that causes the infinite loop. # The following 3 lines execute in an infinite loop. The value of chunk changes during each iteration over the loop. I did not confirm the following - but I assume that at some point, the value of chunk gets set back to the original value that it's set to the first time it enters the infinite loop. for chunk in file_content_gen: buffered_writer.write(chunk) buffered_writer.flush() # ensure all data is written to disk. This might not be necessary. I added after I identified the infinite loop hoping that this would fix it. self._logger.info("Finished unzipping %s", file_name) # This executes for the 1st 6 files. But not the 7th file.

The size of the file that contains the unzipped data got stuck at 57,222kb. I'm assuming the loop is infinite. The file unzips using Windows 11 in less than 5 minutes. I let the program run for at least 2 hours before I stopped it. I also tried running the code in a Docker container running linux and also in an aws batch process (also running linux) with the same results. I paused the program in PyCharm and copied the call stack: _paginate, stream_inflate.py:315 _run, stream_inflate.py:322 _decompress, stream_unzip.py:146 decrypt_none_decompress, stream_unzip.py:281 (279) _iter, stream_unzip.py:295 checked_from_local_header, stream_unzip.py:304 _unzip_file_using_stream_unzip, aws_utils.py:474 unzip_file, aws_utils.py:432 load_new_files, aws_utils.py:279

I noticed that _run() was also calling other methods at different points such as _get_bit(), inflate(get_bits, get_bytes, yield_bytes), and get_bits(num_bits). I noticed that some exceptions were being thrown and caught: def up_to_page_size(num): nonlocal chunk, offset

            while num:
                if offset == len(chunk):
                    try:
                        chunk = next(it)
                   except StopIteration:
                        break

Another exception: def get_byte_readers(iterable): # Return functions to return/"replace" bytes from/to the iterable # - _yield_all: yields chunks as they come up (often for a "body") # - _get_num: returns a single bytes of a given length # - _return_num_unused: puts a number of "unused" bytes "back", to be retrieved by a yield/get call # - _return_bytes_unused: return a bytes instance "back" into the stream, to be retrieved later # - _get_offset_from_start: get the zero-indexed offset from the start of the stream

    chunk = b''
    offset = 0
    offset_from_start = 0
    queue = list()  # Will typically have at most 1 element, so a list is fine
    it = iter(iterable)

    def _next():
        try:
            return queue.pop(0)
       except IndexError:
            return (next_or_truncated_error(it), 0)

This is not an urgent issue for me. But I thought you would want to know. I'm happy to help debug the issue if it's important enough for you to spend time on it. Thanks,

Dwayne

DwayneDriskill avatar Apr 22 '24 15:04 DwayneDriskill

Hi @DwayneDriskill,

Thanks for the report... I think we would like to sort it, but suspect it'll be tricky without a reproducible case. Is there any chance you can create a file with data that can be shared that shows the issue?

(Also, suspect the issue is most likely to be in https://github.com/michalc/stream-inflate, since it handles the Deflate64 aspect. Maybe it could be changed to detect if it gets stuck in an infinite loop? Not sure...)

Thanks,

Michal

michalc avatar Apr 24 '24 09:04 michalc

Two other things...

The size of the file that contains the unzipped data got stuck at 57,222kb. I'm assuming the loop is infinite.

Does the input iterable definitely not progress during this? (Wondering if it's extremely poor performance, rather than an infinite loop)

And this code I realise is unexpected:

with open(file_path, 'rb') as f:
    unzipper = stream_unzip(f)

This works (er, other than the issue you're encountering), because f here is an iterable of bytes. But, each value of the iterable is a "line", and so ending with the newline character \n - which would happen at very arbitrary places in a binary file like a zip file. Testing a deflate64 file I have, this makes each chunk quite short - so it could have a negative performance impact. Not sure if enough to make it seem like it's gotten stuck, but something maybe to rule out.

Instead, can you try this and see if you get the same behaviour?

with open(file_path, 'rb') as f:
    unzipper = stream_unzip(iter(lambda: f.read(65536), b''))

It'll make each input chunk 64k, ignoring where \n characters happen to be in the input data

michalc avatar Apr 27 '24 13:04 michalc

Does the input iterable definitely not progress during this? (Wondering if it's extremely poor performance, rather than an infinite loop)

I am starting to suspect extremely poor performance when unzipping deflate64 - taking approximately 1000 the amount of time in some cases...

This is from https://github.com/uktrade/stream-unzip/pull/83, which (just for investigative purposes) uses https://github.com/michalc/stream-inflate for everything, which in the published stream-unzip is only used for defalte64 files. The test_infozip_zip64_with_descriptors test specifically takes much much longer

michalc avatar Apr 27 '24 14:04 michalc

Hi Michal,

You were right that it seems to be very poor performance. I made the change you suggested and let it run overnight. It finished in about 8.5 hours. When I unzip it with the Windows 11 built in tool it just takes about a minute. Thanks,

Dwayne

ddriskillcssi avatar May 05 '24 22:05 ddriskillcssi

Ah good to know... in a way at least!

Now to solve the performance problem...

michalc avatar May 06 '24 05:05 michalc

Hi @DwayneDriskill / @ddriskillcssi,

To give an update on this, have made some performance improvements to https://github.com/michalc/stream-inflate. From some rough tests, v0.0.18 increases the speed of unzipping deflate64 files about 50% compared to v0.0.15. I know this is still pretty slow, especially for larger files. Might need some sort of more drastic re-architecture... or maybe throwing Cython at it...

Michal

michalc avatar Aug 27 '24 05:08 michalc

And some more speed improvements in v0.0.19. Actually for "easy" deflate64 files it increases speed by a factor of 50 or so compared to v0.0.15. (Easy files would be roughly those with a lot of repetition close in the file)

michalc avatar Aug 27 '24 17:08 michalc

A few more speed improvements done in stream-inflate (which is now up to 0.0.23)

michalc avatar Aug 28 '24 20:08 michalc

Some more speed improvements, now mostly powered by compiling stream-inflate with Cython. stream-inflate now up to version 0.0.29

michalc avatar Sep 18 '24 14:09 michalc

So with the latest version of stream-inflate, v0.0.34, compressing the JSON 6GB UK Tariff file using Deflate64 then takes about 15 minutes to uncompress on my machine using stream-unzip, hovering between uncompressing at a rate of around 5MB/s and sometimes going to 15MB/s or a bit higher.

So probably fast enough to not think it's stuck in an infinite loop, but still way slower than for regular Deflate, which goes beyond 1GB/s on my machine

michalc avatar Sep 22 '24 08:09 michalc

Hi Michal,

Sorry for the late reply. I'm guessing and hoping you're doing this work because you need it for yourself. I coded a work around by unzipping the file using 7 zip and rezipping it using deflate instead of deflate64. This solved my problem and performs well so I'm no longer using stream-inflate. Thanks,

Dwayne

On Tue, Aug 27, 2024 at 1:34 AM Michal Charemza @.***> wrote:

Hi @DwayneDriskill https://github.com/DwayneDriskill / @ddriskillcssi https://github.com/ddriskillcssi,

To give an update on this, have made some performance improvements to https://github.com/michalc/stream-inflate. From some rough tests, v0.0.18 increases the speed of unzipping deflate64 files about 50% compared to v0.0.15. I know this is still pretty slow, especially for larger files. Might need some sort of more drastic re-architecture... or maybe throwing Cython at it...

Michal

— Reply to this email directly, view it on GitHub https://github.com/uktrade/stream-unzip/issues/82#issuecomment-2311608288, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABP4VUDLOLXRL7EZQ7ABWJDZTQFXPAVCNFSM6AAAAABGTASMWCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJRGYYDQMRYHA . You are receiving this because you were mentioned.Message ID: @.***>

DwayneDriskill avatar Sep 24 '24 16:09 DwayneDriskill

@DwayneDriskill Ah good to know!

I'm going to keep this open because I suspect there are still lots of speed improvements to be made, but not expecting replies or anything really - will close when I judge done enough / not much more can be done.

I don't exactly need this myself at the moment, but it's more to make stream-unzip be able to open "any" (within reason...) ZIP file. And maybe a bit of an exercise in optimising Python code.

michalc avatar Sep 24 '24 17:09 michalc