pip icon indicating copy to clipboard operation
pip copied to clipboard

PERF: extract files from wheel in 1MB blocks + skip decoding for 0 bytes files

Open morotti opened this issue 1 year ago • 8 comments

Hello,

Can I offer you a small fix to optimize extraction of files? Do you need a news file for <=1% performance improvement?

Testing with a profiler and a pack of internal packages, this removes a lot of function calls and syscalls.

under profiler:

  • sh.copyfileobj() -> 11149 calls down to 10774
  • read() -> 28645 calls down to 24955
  • _read1() -> 17882 calls down to 14182
  • decompress() -> 17810 calls down to 14110

5818 ms of extraction time is reduced to 5262 ms. it's not a big difference inside a 30 seconds "pip install ..." run.

shutil.copyfileobj() has a default buffer of 64k. most source files are less than 64k so the improvement is not huge. we're not going from the work case where files are read in 4k blocks.

there is good saving by skipping empty files. there seem to be 5-10% of empty init files in most python packages.

looking at the profiler, a fair amount of work is going to extract the file from the zipfile, the standard library does a lot of operations to read/seek headers, verify metadata, and prepare for extraction with read/write locks in case the file is used concurrently by multiple tasks. it's not necessary for pip install. I think there is room for a much bigger improvement by rewriting the ExtZipFile extractor class from the standard lib without the clutter.

Regards.

morotti avatar Jun 27 '24 15:06 morotti

Wait, does this PR outright skips installing empty __init__.py files? That's not something we can accept as the mere existence of __init__.py affects how the package is treated by Python's import system (the lack turns the package into an implicit namespace package).

ichard26 avatar Jun 27 '24 15:06 ichard26

Hello, please have a look at the few lines of code.

It still creates empty files. It doesn't go through the work of initializing the zipfile class to extract empty files, since there's nothing to extract anyway.

I can maybe add a comment to say that open() creates the file, if that's not clear enough.

morotti avatar Jun 27 '24 15:06 morotti

okay I can see the confusion from trying to fit the title in 80 characters, and if you haven't spent days to investigate the zipfile library in the interpreter ^^

        with open(self.dest_path, "wb") as dest:   ## -> this creates the file and opens for writing
            if zipinfo.file_size > 0:
                with self._zip_file.open(zipinfo) as f:
                    # this create a ZipExtFile extractor class
                    # https://github.com/python/cpython/blob/3.12/Lib/zipfile/__init__.py#L839
                    # it does a ton of work to reopen the file, check zip metadata, initialize multiple streams for decompression and check crcs.
                    # you want to skip all of that if the content is 0 bytes, because there's nothing to read anyway.
                    blocksize = min(zipinfo.file_size, 1048576)
                    shutil.copyfileobj(f, dest, blocksize)

morotti avatar Jun 27 '24 15:06 morotti

Could you add a comment clarifying that the simple open/close is sufficient if the file is empty? Also, I didn't immediately recognise that 1048576 was 1MB. Maybe mention that in the comment as well? I don't think you need to explain about ZipExtFile in the comment - that's probably more detail than we need.

Something like

For an empty file, there's nothing we need to copy, so just close the file now we have created it. For non-empty files, copy the file in one read, or in 1M chunks if the file is bigger than 1M.

Actuall, why do you need to check the file size? Won't shutil.copyfileobj(f, dest, 1048576) do a single read anyway if the file is smaller than 1M? Surely the over-allocation of memory for the buffer doesn't matter?

IMO the performance improvement is marginal, but the added code complexity is also small, so I'm OK with this change.

pfmoore avatar Jun 27 '24 16:06 pfmoore

I'm also in favor of this, but some things I will point out on why I think it needs a "feature" news entry so users can identify if it's an issue:

  • zipinfo.file_size is derived from the header metadata and a user could manually modify this, for reasons that might be valid for them
  • There could be a system that reading larger chunks of files is not appropriate (likely such a system would already need to patch pip to work)

notatallshaw avatar Jun 27 '24 16:06 notatallshaw

I've adjusted the news and added comments. should be good now.

Actuall, why do you need to check the file size? Won't shutil.copyfileobj(f, dest, 1048576) do a single read anyway if the file is smaller than 1M? Surely the over-allocation of memory for the buffer doesn't matter?

Ah that brings a lots of interesting questions. shutil.copyfileobj() always does a least two calls to read() because it has no concept of file size. it operates on streams. the stream ends when it gets a return with no data.

zipfile.open() does multiple calls to seek() and read() to locate/verify the headers and prepare for extraction. It's reusing the zipfile.fp handle to the file (BufferedReader) and I think it breaks the read buffer by doing seeks :(

the returned object, the decompressor passed to sh.copyfileobj(), is implemented as a BufferedReader() that decompress on the fly by passing chunks to zlib and with some buffering beyond my comprehension.

The over-allocation of memory has a small effect under the profiler. I'm not sure if it's something because of the 3 layers of buffers and/or something because of heuristics in zlib planning for larger buffers when told to extract more data. (zlib operates on stream and has no concept of filesize). It's probably worthwhile to point out that most files are source code of few kilobytes, so overallocating hundreds of kilobytes is not ideal.

I think there is room for much more substantial gains by writing a custom zipfile decompressor (mostly inspired from the standard lib), that is read-only and more friendly to the file read cache and optimized to decompress files whose size is known in advance.

Until somebody does that, this PR is as far as I could get using the standard compressor :D

morotti avatar Jun 28 '24 09:06 morotti

(rebasing to retrigger tests)

morotti avatar Jun 28 '24 12:06 morotti

hello, all comments reviewed, this is ready to merge

morotti avatar Jul 01 '24 15:07 morotti

@pradyunsg Hello, do you think you could merge this PR for the 24.2 release? It's been approved for 3 weeks, I think it's good to merge.

morotti avatar Jul 15 '24 09:07 morotti

Thanks for the improvement :-)

pfmoore avatar Jul 17 '24 15:07 pfmoore