py7zr
py7zr copied to clipboard
Performance issue when extracting from archives with a large number of files (~100,000)
Describe the bug When an archive contains a large amount of files (100k+ in my tests) the library takes a very long time to extract any of the referenced contents regardless of size. Usually, extracting an item per-archive takes less than a second when the archive has < ~1000 items. However, in the case where an archive contains a high amount of items (~100,000 in this report), it can take up to a minute to extract a single item within the archive, even if the compressed size is ~4KB for example.
Related issue None.
To Reproduce I am using py7zr in my project on GitHub: file-repacker. In the recompression process I profiled the code:
source_archive = py7zr.SevenZipFile(f"{current_path}{filename}", mode="r")
for archive_info in source_archive.list():
logging.info(f":PID-{process_pid}: Archive content in [{filename}]: /{archive_info.filename}")
prof.enable()
# Currently py7zr doesn't support adding empty directories directly, so use tmpfs (https://github.com/miurahr/py7zr/issues/412).
if archive_info.is_directory is True:
with tempfile.TemporaryDirectory(prefix="file-repacker-") as temp_file:
compressed_archive.write(temp_file, f"{archive_info.filename}")
elif archive_info.uncompressed == 0:
compressed_archive.writef(empty_file, f"{archive_info.filename}")
else:
binary = source_archive.read([archive_info.filename])
compressed_archive.writef(binary.get(archive_info.filename), archive_info.filename)
source_archive.reset()
prof.disable()
s = io.StringIO()
sortby = SortKey.CUMULATIVE
ps = pstats.Stats(prof, stream=s).sort_stats(sortby)
ps.print_stats()
print(s.getvalue())
logger.warn(f"STATS: {s.getvalue()}")
Expected behavior Items within the archive should be extracted immediately (< 1 second).
Environment (please complete the following information):
- OS: Kubuntu 22.10
- Python 3.10.7
- py7zr version: 0.20.2
Test data(please attach in the report): I have a 1.26MB .7z archive that contains ~111k files which can be provided on request.
Additional context cProfile output from the above code extracting an item from the archive containing 111k items:
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.001 0.001 95.527 95.527 ./file-repacker/venv/lib/python3.10/site-packages/py7zr/py7zr.py:961(read)
1 79.462 79.462 95.526 95.526 ./file-repacker/venv/lib/python3.10/site-packages/py7zr/py7zr.py:525(_extract)
118622 0.606 0.000 14.403 0.000 ./file-repacker/venv/lib/python3.10/site-packages/py7zr/helpers.py:464(get_sanitized_output_path)
Note the process time taken up by _extract for a tiny XML file within the archive.
Note: I don't get the same issue when processing the same amount of files in a .zip
file containing the same items and using the zipfile
Python library.
Note: I don't get the same issue when processing the same amount of files in a
.zip
file containing the same items and using thezipfile
Python library.
7-zip takes a solid compression archive format, that means when you want to extract last single file in 100,000 files single block solid archive, 7-zip format demand extraction function to read 99,999 archived files then output single file. zip makes a concatenate of each compressed files, that means extract function seek archive to position, then extract a single file.
see https://en.wikipedia.org/wiki/Solid_compression
When the archive has a multi-sectional solid archive structure, py7zr takes multi-process
strategy for concurrent extraction.
Could you check an archive?
For example, you can see tests/data/mblock_1.7z
that is multi-block archive, see a outout of command 7z l mblock_1.7z
You will see
Path = mblock_1.7z
Type = 7z
Physical Size = 631690
Headers Size = 2305
Method = LZMA2:1536k BCJ
Solid = +
Blocks = 3
That said blocks == 3
then py7zr will extract it in 3-processes.
Thanks for the reply - however, this also happens in the provided example as the first file in the list returned by the iterator, or any others.
Also, if I use command-line 7zip to extract a single-file item in the archive I don't get the same issue, it will extract any file contained in the archive without delay.
I can provide the 7z file to test.
When I decompression copy
mode archive (but using AES-256 encrypt), the problem still happen.
(use other software (like bandizip) to decompression, it always decompress archive much faster(2MB/s -> 120MB/s), even spend same CPU usage)
+1, confirming slowness with the same use-case, and also that using 7z from the command line does not have this problem.
I did some profiling and found that py7zr was spending most of its time doing inefficient python operations like list searches, resulting in O(n^2) complexity in the number of files. I've fixed this in #555 which gave me a massive speedup on large archives.
v0.21.0 released with the fix.