py7zr Performance issue when extracting from archives with a large number of files (~100,000)

Describe the bug When an archive contains a large amount of files (100k+ in my tests) the library takes a very long time to extract any of the referenced contents regardless of size. Usually, extracting an item per-archive takes less than a second when the archive has < ~1000 items. However, in the case where an archive contains a high amount of items (~100,000 in this report), it can take up to a minute to extract a single item within the archive, even if the compressed size is ~4KB for example.

Related issue None.

To Reproduce I am using py7zr in my project on GitHub: file-repacker. In the recompression process I profiled the code:

                source_archive = py7zr.SevenZipFile(f"{current_path}{filename}", mode="r")
                for archive_info in source_archive.list():
                    logging.info(f":PID-{process_pid}: Archive content in [{filename}]: /{archive_info.filename}")

                    prof.enable()
                    
                    # Currently py7zr doesn't support adding empty directories directly, so use tmpfs (https://github.com/miurahr/py7zr/issues/412).
                    if archive_info.is_directory is True:
                        with tempfile.TemporaryDirectory(prefix="file-repacker-") as temp_file:
                            compressed_archive.write(temp_file, f"{archive_info.filename}")
                    elif archive_info.uncompressed == 0:
                        compressed_archive.writef(empty_file, f"{archive_info.filename}")
                    else:
                        binary = source_archive.read([archive_info.filename])
                        compressed_archive.writef(binary.get(archive_info.filename), archive_info.filename)
                        source_archive.reset()

                    prof.disable()
                    s = io.StringIO()
                    sortby = SortKey.CUMULATIVE
                    ps = pstats.Stats(prof, stream=s).sort_stats(sortby)
                    ps.print_stats()
                    print(s.getvalue())
                    logger.warn(f"STATS: {s.getvalue()}")

Expected behavior Items within the archive should be extracted immediately (< 1 second).

Environment (please complete the following information):

OS: Kubuntu 22.10
Python 3.10.7
py7zr version: 0.20.2

Test data(please attach in the report): I have a 1.26MB .7z archive that contains ~111k files which can be provided on request.

Additional context cProfile output from the above code extracting an item from the archive containing 111k items:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001   95.527   95.527 ./file-repacker/venv/lib/python3.10/site-packages/py7zr/py7zr.py:961(read)
        1   79.462   79.462   95.526   95.526 ./file-repacker/venv/lib/python3.10/site-packages/py7zr/py7zr.py:525(_extract)
   118622    0.606    0.000   14.403    0.000 ./file-repacker/venv/lib/python3.10/site-packages/py7zr/helpers.py:464(get_sanitized_output_path)

Note the process time taken up by _extract for a tiny XML file within the archive.

Note: I don't get the same issue when processing the same amount of files in a .zip file containing the same items and using the zipfile Python library.

Jan 30 '23 16:01 noisysoil

Note: I don't get the same issue when processing the same amount of files in a .zip file containing the same items and using the zipfile Python library.

7-zip takes a solid compression archive format, that means when you want to extract last single file in 100,000 files single block solid archive, 7-zip format demand extraction function to read 99,999 archived files then output single file. zip makes a concatenate of each compressed files, that means extract function seek archive to position, then extract a single file.

see https://en.wikipedia.org/wiki/Solid_compression

When the archive has a multi-sectional solid archive structure, py7zr takes multi-process strategy for concurrent extraction.

Could you check an archive?

For example, you can see tests/data/mblock_1.7z that is multi-block archive, see a outout of command 7z l mblock_1.7z

You will see

Path = mblock_1.7z
Type = 7z
Physical Size = 631690
Headers Size = 2305
Method = LZMA2:1536k BCJ
Solid = +
Blocks = 3

That said blocks == 3 then py7zr will extract it in 3-processes.

Jan 31 '23 02:01 miurahr

Thanks for the reply - however, this also happens in the provided example as the first file in the list returned by the iterator, or any others.

Also, if I use command-line 7zip to extract a single-file item in the archive I don't get the same issue, it will extract any file contained in the archive without delay.

I can provide the 7z file to test.

Feb 06 '23 15:02 noisysoil

When I decompression copy mode archive (but using AES-256 encrypt), the problem still happen. (use other software (like bandizip) to decompression, it always decompress archive much faster(2MB/s -> 120MB/s), even spend same CPU usage)

Feb 09 '23 05:02 IceTiki

+1, confirming slowness with the same use-case, and also that using 7z from the command line does not have this problem.

Nov 18 '23 02:11 bycn

I did some profiling and found that py7zr was spending most of its time doing inefficient python operations like list searches, resulting in O(n^2) complexity in the number of files. I've fixed this in #555 which gave me a massive speedup on large archives.

Jan 26 '24 09:01 vladfi1

v0.21.0 released with the fix.

Apr 02 '24 03:04 miurahr

py7zr py7zr copied to clipboard

Performance issue when extracting from archives with a large number of files (~100,000)

py7zr
py7zr copied to clipboard