py7zr icon indicating copy to clipboard operation
py7zr copied to clipboard

refactor: drop read/readall method, and add feature to stream data into user object

Open miurahr opened this issue 1 year ago • 19 comments

Pull request type

select from below

  • Feature enhancement

Which ticket is resolved?

What does this PR change?

  • refactor: introduce py7zr.io.py
  • fix: Drop The read/readall methods
  • feat: Extend extract/extactall accept factory callback WriterFactory
  • feat: WriterFactory should create Py7zIO object and handle extracted content by user own.

Other information

User go wrong with the name "read" and "readall" which they belive it should return stream like zipfile. It is why I remove wrong method names.

Reference #575 and #579

miurahr avatar Oct 11 '24 23:10 miurahr

@jaboja @starkapandan You can define your class which is a py7zr.io.Py7zIO type object, that has open, write, read, seek, close, and size methods. When py7zr calls write(bytes), you can stream it your favorite way instead of putting it into memory.

You should define the py7zr.io.WriterFactory type of class, that returns the object I described above when called the create method. The create method is called with the argument frame which is an archive name.

Then you are ready to use a new feature. You call the readall method with the factory=<your_writer_factory_object> argument.

miurahr avatar Oct 11 '24 23:10 miurahr

Please check test_stream.py for an example implementation.

miurahr avatar Oct 13 '24 00:10 miurahr

@Zoynels @itanfeng @starkapandan @jaboja @aquirin Feedbacks?

miurahr avatar Oct 13 '24 03:10 miurahr

Does this also solves the memory error described here?

  • https://github.com/miurahr/py7zr/issues/579#issue-2190900415
  • https://github.com/miurahr/py7zr/issues/575#issue-2165043877

I have tried the code in the first link, but I still have a memory error.

File "py7zr.git/py7zr/io.py", line 123, in write return self._buf.write(s) File "py7zr.git/py7zr/io.py", line 52, in write return self._buffer.write(s) MemoryError

Setup:

  • Python 3.10.0
  • pip3 install pybcj inflate64 pyppmd pyzstd pycryptodomex psutil
  • imported the topic/miurahr/memory-read-extend-callback branch
  • run the code in the first link
  • note the URL does not work anymore, you can take wget https://dumps.wikimedia.org/plwiki/20241001/plwiki-20241001-pages-meta-history1.xml-p1p6646.7z

By the way I have this error while importing py7zr

ModuleNotFoundError: No module named 'py7zr.version'

aquirin avatar Oct 13 '24 14:10 aquirin

Does this also solves the memory error described here?

* [read/readall dumps the decompressed files to memory, instead of streaming them #579 (comment)](https://github.com/miurahr/py7zr/issues/579#issue-2190900415)

* [Memory leaks when uncompressing multi-volume archives #575 (comment)](https://github.com/miurahr/py7zr/issues/575#issue-2165043877)

I have tried the code in the first link, but I still have a memory error.

You are trying in backward compatible way which is not changed anyway. Please try the solution which you should provide your own class which implement WriterFactory interface to read method.

Ok, I understand that keeping backward compatibility make users misunderstand that user seems to be able to use old way. I would like to remove old way.

miurahr avatar Oct 13 '24 23:10 miurahr

@Zoynels @itanfeng @starkapandan @jaboja @aquirin

Please read the test cases.

  • The test_stream.py demonstrates how to implement the streaming extraction and compression.
  • The test_stream_nwstorage.py demonstrates how to implement the streaming extraction onto network storage.

miurahr avatar Oct 14 '24 03:10 miurahr

#579 requests the streaming of decompressed data and print 10 bytes, here it is.

Here is the example to print first 10 bytes of contents in streaming.

# main routine
with SevenZipFile('archive.7z', 'r') as zip:
    zip.readall(MyFactory())        

# the class accept archive contents

class MyFactory(WriterFactory):
    def create(self, fname) -> Py7zIO:
        return  MyIO(fname)

class MyIO(Py7zIO):
    def __init__(self, fname):
        self.fname = fname
        self.done = False

    def write(self, s: [bytes, bytearray]):
        """print first 10 bytes"""
        if not self.done:
            print(f'{self.fname}: {myio.read(10)}...\n')    
            self.done = True

    def read(self, length: int = 0) -> bytes:
         return b''

    def seek(self, offset: int, whence: int = 0) -> int:
        return 0

    def size(self) -> int:
        return 0

miurahr avatar Oct 14 '24 04:10 miurahr

@miurahr I have tested your updated PR using your new streaming technique, but unfortunately I am still reporting a huge memory consumption (2Gb) for multipart 7z files containing a lot of files. As I am not writing/reading any buffers, I expect no memory leaks.

$ /usr/bin/time -v python code.py
        Maximum resident set size (kbytes): 1941548
class MyFactory(WriterFactory):
    def __init__(self):
        self.count = 0

    def create(self, filename: str) -> Py7zIO:
        self.count += 1

zip_path = ".../multivolume/test.7z" # See https://github.com/miurahr/py7zr/issues/575 to get this file

with multivolumefile.open(zip_path, mode='rb') as multizip_handler:
    with SevenZipFile(multizip_handler, 'r') as zip_handler:
        mf = MyFactory()
        zip_handler.readall(mf)

print("Count:", mf.count) # should output 8000

aquirin avatar Oct 16 '24 14:10 aquirin

@miurahr I have tested your updated PR using your new streaming technique, but unfortunately I am still reporting a huge memory consumption (2Gb) for multipart 7z files containing a lot of files. As I am not writing/reading any buffers, I expect no memory leaks.

What your Py7zIO class?

Could you share Python profiling result? https://docs.python.org/3/library/profile.html

FYI:

There are definitions of "debug" dependency https://github.com/miurahr/py7zr/blob/master/pyproject.toml#L90C1-L93C2

debug = [
      "pytest",
      "pytest-leaks",
      "pytest-profiling",
]

When running tox -e mprof, you can see the result of pytest profiling. Here is a pytest definition in pyprojet.toml

[testenv:mprof]
extras = test, debug
commands =
    mprof run --multiprocess pytest
    mprof plot --output memory-profile.png
deps =
    memory_profiler
    matplotlib

miurahr avatar Oct 16 '24 22:10 miurahr

What your Py7zIO class?

from py7zr.io import WriterFactory, Py7zIO

Could you share Python profiling result? https://docs.python.org/3/library/profile.html

Attached. test_multi_stats.zip

I do not have tox, let me know if the profiling is enough for you

aquirin avatar Oct 16 '24 23:10 aquirin

What your Py7zIO class?

from py7zr.io import WriterFactory, Py7zIO

It is nonsense.

Your code does not return an object of Py7zIO interface.

    def create(self, filename: str) -> Py7zIO:
        self.count += 1

As you know an object oriented programming, you must have your complete implementation class of Py7zIO interface. You may have MyPy7zIO or similar class definition. If not, it is invalid.

miurahr avatar Oct 17 '24 00:10 miurahr

From your stats, there are so many BytesIO read access. I believe I removed BytesIO object from py7zr library readall method in this topic branch. It is still used in test code.

   134973    0.057    0.000    0.057    0.000 {method 'read' of '_io.BytesIO' objects}

There are many other place to use BytesIO object.

$ grep BytesIO py7zr/*
py7zr/archiveinfo.py:from io import BytesIO
py7zr/archiveinfo.py:            buffer = io.BytesIO(fp.read(size))
py7zr/archiveinfo.py:    def retrieve(cls, fp: BinaryIO, buffer: BytesIO, start_pos: int, password=None):
py7zr/archiveinfo.py:    def _read(self, fp: BinaryIO, buffer: BytesIO, start_pos: int, password) -> None:
py7zr/archiveinfo.py:        buffer2 = io.BytesIO()
py7zr/archiveinfo.py:        buf = io.BytesIO()
py7zr/archiveinfo.py:        buf = io.BytesIO()
py7zr/io.py:class Py7zBytesIO(Py7zIO):
py7zr/io.py:        self._buffer = io.BytesIO()
py7zr/io.py:        """request writer which compatible with BinaryIO, such as BytesIO object."""
py7zr/io.py:class BytesIOFactory(WriterFactory):
py7zr/io.py:        self.products: dict[str, Py7zBytesIO] = {}
py7zr/io.py:        product = Py7zBytesIO(filename, self.limit)
py7zr/py7zr.py:        buffer = io.BytesIO(self.fp.read(self.sig_header.nextheadersize))
py7zr/py7zr.py:        if isinstance(bio, io.BytesIO):
py7zr/py7zr.py:            self._writef(io.BytesIO(data.encode("UTF-8")), arcname)
py7zr/py7zr.py:            self._writef(io.BytesIO(bytes(data)), arcname)
py7zr/py7zr.py:                        with io.BytesIO() as ofp:
py7zr/py7zr.py:                        with io.BytesIO() as omfp:
py7zr/py7zr.py:            fd = io.BytesIO(tgt)

I would like to reduce it.

miurahr avatar Oct 17 '24 00:10 miurahr

@miurahr I have tested your updated PR using your new streaming technique, but unfortunately I am still reporting a huge memory consumption (2Gb) for multipart 7z files containing a lot of files. As I am not writing/reading any buffers, I expect no memory leaks.

@aquirin

It seems because py7zr try to use as many memory as possible to speed-up extraction. Please see https://github.com/miurahr/py7zr/blob/master/py7zr/properties.py#L87


def get_memory_limit():
    """
    Get memory limit for allocating decompression chunk buffer.
    :return: allowed chunk size in bytes.
    """
    default_limit = int(128e6)
    if sys.platform.startswith("linux") or sys.platform.startswith("darwin") or sys.platform.startswith("openbsd"):
        import resource

        import psutil

        try:
            soft, _ = resource.getrlimit(resource.RLIMIT_DATA)
            if soft == -1:
                avmem = psutil.virtual_memory().available
                return min(default_limit, (avmem - int(256e6)) >> 2)
            else:
                return min(default_limit, (soft - int(256e6)) >> 2)
        except AttributeError:
            pass
    # fall back to default
    return default_limit

If you want to reduce memory usage, please limit process memory using OS feature.

Here is a test code to limit the memory usage. https://github.com/miurahr/py7zr/blob/master/tests/test_misc.py#L163

def test_extract_high_compression_rate(tmp_path):
    gen = Generator()
    with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "w") as source:
        source.writef(gen, "source")
    limit = int(512e6)  # 0.5GB
    with limit_memory(limit):
        with py7zr.SevenZipFile(tmp_path.joinpath("target.7z"), "r") as target:
            target.extractall(path=tmp_path)


@contextmanager
def limit_memory(maxsize: int):
    """
    Decorator to limit memory. Raise MemoryError when the limit is exceeded.

    :param maxsize: Maximum size of memory resource to limit
    :raises: MemoryError: When function reaches the limit.
    """
    if sys.platform.startswith("win") or sys.platform.startswith("cygwin"):
        yield
    else:
        import resource

        import psutil  # type: ignore

        soft, hard = resource.getrlimit(resource.RLIMIT_AS)
        soft_new = psutil.Process(os.getpid()).memory_info().rss + maxsize
        if soft == -1 or soft_new < soft:
            resource.setrlimit(resource.RLIMIT_AS, (soft_new, hard))
        try:
            yield
        finally:
            resource.setrlimit(resource.RLIMIT_AS, (soft, hard))

The usage of 2GB is not a bug, but a designated.

miurahr avatar Oct 17 '24 00:10 miurahr

Ok thanks for the insights. But is it possible to add an API parameter to set the memory limit? Relying on an external context manager looks convoluted to me especially from the fact that you made a direct function to get the limit.

aquirin avatar Oct 17 '24 09:10 aquirin

It is nonsense. Your code does not return an object of Py7zIO interface.

Sorry, it is not non-sense, it was just a leftover from your code sample that I used and that I forgot to clean. From the point of view of execution, it did not prevented to run until the end, and I guess it is not related to the memory limit as your last message suggested. Thanks for the investigation!

aquirin avatar Oct 17 '24 09:10 aquirin

Please remind the 7-zip format is designed as "solid-compression".

https://py7zr.readthedocs.io/en/latest/archive_format.html https://en.wikipedia.org/wiki/Solid_compression

@starkapandan your proposal syntax is basically allowed users to random access to file entries which does not allow in the logical reason. The core of the library is NOT a similar syntax as zipfile module, but a difference of the format.

You are welcome to provide a sample working implementation to realize the proposal, you are welcome. I am not sure who can break a logical limitation of the design.

Zip format

zipfile is non-solid format, so zipfile can open the zip file, and seek to the start of the compressed data, then return it to user. Please look at a graphical explanation in Wikipedia.

7zip format

7zip format is solid format, so py7zr need to decompress content to reach to target file entry. There is NO obvious position of compressed data of the file entry. When library allow users to random access to archive entry, the library read the all the uncompressed data into memory, or disk, and return data. Otherwise, it also requests user to give a callback object and write data into the object when the library visits the archive data.

miurahr avatar Oct 20 '24 04:10 miurahr

read and readall method name lead users wrong direction. I would like to remove read/readall methods.

miurahr avatar Oct 20 '24 04:10 miurahr

Here is an updated manual of releted feature.

https://py7zr--620.org.readthedocs.build/en/620/advanced.html https://py7zr--620.org.readthedocs.build/en/620/api.html#py7zr.SevenZipFile.extract

Please understand an description and an example code.

miurahr avatar Oct 20 '24 05:10 miurahr

We at RomM have forked py7zr especifically to use this in-progress PR, as we needed to stop getting OOM issues during decompression.

I'm commenting here in case the implementation we needed to work on is relevant for the refactor you're working on. The changes are at https://github.com/rommapp/romm/pull/1282, and something that was very useful for us to have a similar approach to other decompression libs, was the CallbackIOFactory we added, as a way to provide callables to run during write and read, to still maintain full responsability of what needs to be done, on the callee instead of moving that logic to the factory.

adamantike avatar Dec 02 '24 13:12 adamantike

We at RomM have forked py7zr especifically to use this in-progress PR, as we needed to stop getting OOM issues during decompression.

I'm commenting here in case the implementation we needed to work on is relevant for the refactor you're working on. The changes are at rommapp/romm#1282, and something that was very useful for us to have a similar approach to other decompression libs, was the CallbackIOFactory we added, as a way to provide callables to run during write and read, to still maintain full responsability of what needs to be done, on the callee instead of moving that logic to the factory.

I am confident with your testimony. When I solve all the issues of CI error, I will merge here.

miurahr avatar Dec 03 '24 06:12 miurahr