py7zr
py7zr copied to clipboard
refactor: drop read/readall method, and add feature to stream data into user object
Pull request type
select from below
- Feature enhancement
Which ticket is resolved?
What does this PR change?
- refactor: introduce py7zr.io.py
- fix: Drop The read/readall methods
- feat: Extend extract/extactall accept factory callback WriterFactory
- feat: WriterFactory should create Py7zIO object and handle extracted content by user own.
Other information
User go wrong with the name "read" and "readall" which they belive it should return stream like zipfile. It is why I remove wrong method names.
Reference #575 and #579
@jaboja @starkapandan
You can define your class which is a py7zr.io.Py7zIO type object, that has open, write, read, seek, close, and size methods. When py7zr calls write(bytes), you can stream it your favorite way instead of putting it into memory.
You should define the py7zr.io.WriterFactory type of class, that returns the object I described above when called the create method.
The create method is called with the argument frame which is an archive name.
Then you are ready to use a new feature.
You call the readall method with the factory=<your_writer_factory_object> argument.
Please check test_stream.py for an example implementation.
@Zoynels @itanfeng @starkapandan @jaboja @aquirin Feedbacks?
Does this also solves the memory error described here?
- https://github.com/miurahr/py7zr/issues/579#issue-2190900415
- https://github.com/miurahr/py7zr/issues/575#issue-2165043877
I have tried the code in the first link, but I still have a memory error.
File "py7zr.git/py7zr/io.py", line 123, in write return self._buf.write(s) File "py7zr.git/py7zr/io.py", line 52, in write return self._buffer.write(s) MemoryError
Setup:
- Python 3.10.0
- pip3 install pybcj inflate64 pyppmd pyzstd pycryptodomex psutil
- imported the
topic/miurahr/memory-read-extend-callbackbranch - run the code in the first link
- note the URL does not work anymore, you can take
wget https://dumps.wikimedia.org/plwiki/20241001/plwiki-20241001-pages-meta-history1.xml-p1p6646.7z
By the way I have this error while importing py7zr
ModuleNotFoundError: No module named 'py7zr.version'
Does this also solves the memory error described here?
* [read/readall dumps the decompressed files to memory, instead of streaming them #579 (comment)](https://github.com/miurahr/py7zr/issues/579#issue-2190900415) * [Memory leaks when uncompressing multi-volume archives #575 (comment)](https://github.com/miurahr/py7zr/issues/575#issue-2165043877)I have tried the code in the first link, but I still have a memory error.
You are trying in backward compatible way which is not changed anyway.
Please try the solution which you should provide your own class which implement WriterFactory interface to read method.
Ok, I understand that keeping backward compatibility make users misunderstand that user seems to be able to use old way. I would like to remove old way.
@Zoynels @itanfeng @starkapandan @jaboja @aquirin
Please read the test cases.
- The test_stream.py demonstrates how to implement the streaming extraction and compression.
- The test_stream_nwstorage.py demonstrates how to implement the streaming extraction onto network storage.
#579 requests the streaming of decompressed data and print 10 bytes, here it is.
Here is the example to print first 10 bytes of contents in streaming.
# main routine
with SevenZipFile('archive.7z', 'r') as zip:
zip.readall(MyFactory())
# the class accept archive contents
class MyFactory(WriterFactory):
def create(self, fname) -> Py7zIO:
return MyIO(fname)
class MyIO(Py7zIO):
def __init__(self, fname):
self.fname = fname
self.done = False
def write(self, s: [bytes, bytearray]):
"""print first 10 bytes"""
if not self.done:
print(f'{self.fname}: {myio.read(10)}...\n')
self.done = True
def read(self, length: int = 0) -> bytes:
return b''
def seek(self, offset: int, whence: int = 0) -> int:
return 0
def size(self) -> int:
return 0
@miurahr I have tested your updated PR using your new streaming technique, but unfortunately I am still reporting a huge memory consumption (2Gb) for multipart 7z files containing a lot of files. As I am not writing/reading any buffers, I expect no memory leaks.
$ /usr/bin/time -v python code.py
Maximum resident set size (kbytes): 1941548
class MyFactory(WriterFactory):
def __init__(self):
self.count = 0
def create(self, filename: str) -> Py7zIO:
self.count += 1
zip_path = ".../multivolume/test.7z" # See https://github.com/miurahr/py7zr/issues/575 to get this file
with multivolumefile.open(zip_path, mode='rb') as multizip_handler:
with SevenZipFile(multizip_handler, 'r') as zip_handler:
mf = MyFactory()
zip_handler.readall(mf)
print("Count:", mf.count) # should output 8000
@miurahr I have tested your updated PR using your new streaming technique, but unfortunately I am still reporting a huge memory consumption (2Gb) for multipart 7z files containing a lot of files. As I am not writing/reading any buffers, I expect no memory leaks.
What your Py7zIO class?
Could you share Python profiling result? https://docs.python.org/3/library/profile.html
FYI:
There are definitions of "debug" dependency https://github.com/miurahr/py7zr/blob/master/pyproject.toml#L90C1-L93C2
debug = [
"pytest",
"pytest-leaks",
"pytest-profiling",
]
When running tox -e mprof, you can see the result of pytest profiling. Here is a pytest definition in pyprojet.toml
[testenv:mprof]
extras = test, debug
commands =
mprof run --multiprocess pytest
mprof plot --output memory-profile.png
deps =
memory_profiler
matplotlib
What your
Py7zIOclass?
from py7zr.io import WriterFactory, Py7zIO
Could you share Python profiling result? https://docs.python.org/3/library/profile.html
Attached. test_multi_stats.zip
I do not have tox, let me know if the profiling is enough for you
What your
Py7zIOclass?
from py7zr.io import WriterFactory, Py7zIO
It is nonsense.
Your code does not return an object of Py7zIO interface.
def create(self, filename: str) -> Py7zIO:
self.count += 1
As you know an object oriented programming, you must have your complete implementation class of Py7zIO interface.
You may have MyPy7zIO or similar class definition. If not, it is invalid.
From your stats, there are so many BytesIO read access.
I believe I removed BytesIO object from py7zr library readall method in this topic branch. It is still used in test code.
134973 0.057 0.000 0.057 0.000 {method 'read' of '_io.BytesIO' objects}
There are many other place to use BytesIO object.
$ grep BytesIO py7zr/*
py7zr/archiveinfo.py:from io import BytesIO
py7zr/archiveinfo.py: buffer = io.BytesIO(fp.read(size))
py7zr/archiveinfo.py: def retrieve(cls, fp: BinaryIO, buffer: BytesIO, start_pos: int, password=None):
py7zr/archiveinfo.py: def _read(self, fp: BinaryIO, buffer: BytesIO, start_pos: int, password) -> None:
py7zr/archiveinfo.py: buffer2 = io.BytesIO()
py7zr/archiveinfo.py: buf = io.BytesIO()
py7zr/archiveinfo.py: buf = io.BytesIO()
py7zr/io.py:class Py7zBytesIO(Py7zIO):
py7zr/io.py: self._buffer = io.BytesIO()
py7zr/io.py: """request writer which compatible with BinaryIO, such as BytesIO object."""
py7zr/io.py:class BytesIOFactory(WriterFactory):
py7zr/io.py: self.products: dict[str, Py7zBytesIO] = {}
py7zr/io.py: product = Py7zBytesIO(filename, self.limit)
py7zr/py7zr.py: buffer = io.BytesIO(self.fp.read(self.sig_header.nextheadersize))
py7zr/py7zr.py: if isinstance(bio, io.BytesIO):
py7zr/py7zr.py: self._writef(io.BytesIO(data.encode("UTF-8")), arcname)
py7zr/py7zr.py: self._writef(io.BytesIO(bytes(data)), arcname)
py7zr/py7zr.py: with io.BytesIO() as ofp:
py7zr/py7zr.py: with io.BytesIO() as omfp:
py7zr/py7zr.py: fd = io.BytesIO(tgt)
I would like to reduce it.
@miurahr I have tested your updated PR using your new streaming technique, but unfortunately I am still reporting a huge memory consumption (2Gb) for multipart 7z files containing a lot of files. As I am not writing/reading any buffers, I expect no memory leaks.
@aquirin
It seems because py7zr try to use as many memory as possible to speed-up extraction. Please see https://github.com/miurahr/py7zr/blob/master/py7zr/properties.py#L87
def get_memory_limit():
"""
Get memory limit for allocating decompression chunk buffer.
:return: allowed chunk size in bytes.
"""
default_limit = int(128e6)
if sys.platform.startswith("linux") or sys.platform.startswith("darwin") or sys.platform.startswith("openbsd"):
import resource
import psutil
try:
soft, _ = resource.getrlimit(resource.RLIMIT_DATA)
if soft == -1:
avmem = psutil.virtual_memory().available
return min(default_limit, (avmem - int(256e6)) >> 2)
else:
return min(default_limit, (soft - int(256e6)) >> 2)
except AttributeError:
pass
# fall back to default
return default_limit
If you want to reduce memory usage, please limit process memory using OS feature.
Here is a test code to limit the memory usage. https://github.com/miurahr/py7zr/blob/master/tests/test_misc.py#L163
def test_extract_high_compression_rate(tmp_path):
gen = Generator()
with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "w") as source:
source.writef(gen, "source")
limit = int(512e6) # 0.5GB
with limit_memory(limit):
with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "r") as target:
target.extractall(path=tmp_path)
@contextmanager
def limit_memory(maxsize: int):
"""
Decorator to limit memory. Raise MemoryError when the limit is exceeded.
:param maxsize: Maximum size of memory resource to limit
:raises: MemoryError: When function reaches the limit.
"""
if sys.platform.startswith("win") or sys.platform.startswith("cygwin"):
yield
else:
import resource
import psutil # type: ignore
soft, hard = resource.getrlimit(resource.RLIMIT_AS)
soft_new = psutil.Process(os.getpid()).memory_info().rss + maxsize
if soft == -1 or soft_new < soft:
resource.setrlimit(resource.RLIMIT_AS, (soft_new, hard))
try:
yield
finally:
resource.setrlimit(resource.RLIMIT_AS, (soft, hard))
The usage of 2GB is not a bug, but a designated.
Ok thanks for the insights. But is it possible to add an API parameter to set the memory limit? Relying on an external context manager looks convoluted to me especially from the fact that you made a direct function to get the limit.
It is nonsense. Your code does not return an object of Py7zIO interface.
Sorry, it is not non-sense, it was just a leftover from your code sample that I used and that I forgot to clean. From the point of view of execution, it did not prevented to run until the end, and I guess it is not related to the memory limit as your last message suggested. Thanks for the investigation!
Please remind the 7-zip format is designed as "solid-compression".
https://py7zr.readthedocs.io/en/latest/archive_format.html https://en.wikipedia.org/wiki/Solid_compression
@starkapandan your proposal syntax is basically allowed users to random access to file entries which does not allow in the logical reason.
The core of the library is NOT a similar syntax as zipfile module, but a difference of the format.
You are welcome to provide a sample working implementation to realize the proposal, you are welcome. I am not sure who can break a logical limitation of the design.
Zip format
zipfile is non-solid format, so zipfile can open the zip file, and seek to the start of the compressed data, then return it to user. Please look at a graphical explanation in Wikipedia.
7zip format
7zip format is solid format, so py7zr need to decompress content to reach to target file entry. There is NO obvious position of compressed data of the file entry. When library allow users to random access to archive entry, the library read the all the uncompressed data into memory, or disk, and return data. Otherwise, it also requests user to give a callback object and write data into the object when the library visits the archive data.
read and readall method name lead users wrong direction. I would like to remove read/readall methods.
Here is an updated manual of releted feature.
https://py7zr--620.org.readthedocs.build/en/620/advanced.html https://py7zr--620.org.readthedocs.build/en/620/api.html#py7zr.SevenZipFile.extract
Please understand an description and an example code.
We at RomM have forked py7zr especifically to use this in-progress PR, as we needed to stop getting OOM issues during decompression.
I'm commenting here in case the implementation we needed to work on is relevant for the refactor you're working on. The changes are at https://github.com/rommapp/romm/pull/1282, and something that was very useful for us to have a similar approach to other decompression libs, was the CallbackIOFactory we added, as a way to provide callables to run during write and read, to still maintain full responsability of what needs to be done, on the callee instead of moving that logic to the factory.
We at RomM have forked
py7zrespecifically to use this in-progress PR, as we needed to stop getting OOM issues during decompression.I'm commenting here in case the implementation we needed to work on is relevant for the refactor you're working on. The changes are at rommapp/romm#1282, and something that was very useful for us to have a similar approach to other decompression libs, was the
CallbackIOFactorywe added, as a way to provide callables to run duringwriteandread, to still maintain full responsability of what needs to be done, on the callee instead of moving that logic to the factory.
I am confident with your testimony. When I solve all the issues of CI error, I will merge here.