pex
pex copied to clipboard
consume packed wheel cache in zipapp creation
Closes #2158.
Problem
#1675 has another repro.
Generating a --compress --layout zipapp (the default) regularly exhibits pathologically slow performance (46s zip / 1m1s total) on many real-world inputs (especially tensorflow). This occurs even when all dists are already downloaded from pypi:
; time pex -vvvvvvvvv --layout zipapp --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' -o test.pex
# ...
# Saving PEX file to test.pex
# pex: Zipping PEX file.: 45659.1ms
# 57.18s user 3.02s system 97% cpu 1:01.57 total
; ls -lAvFh test.pex
# -rwxrwxr-x 1 cosmicexplorer cosmicexplorer 617M Jul 26 09:17 test.pex*
; du -b test.pex
# 616754775 test.pex
Alternatives
#1705 introduced --no-compress, which drastically improves the zip performance (2.2s zip => 20.6x speedup / 19.3s total = 2.4x speedup):
; time pex -vvvvvvvvv --layout zipapp --no-compress --resolver-version=pip-2020-resolver 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' -o test.pex
# ...
# Saving PEX file to test.pex
# pex: Zipping PEX file.: 2209.7ms
# pex 14.11s user 3.06s system 89% cpu 19.285 total
but as #1686 describes, it also introduces a significant tradeoff in file size (1.6G => 2.5x blowup):
; ls -lAvFh test.pex
# -rwxrwxr-x 1 cosmicexplorer cosmicexplorer 1.6G Jul 26 09:20 test.pex*
; du -b test.pex
# 1541737605 test.pex
The cache entries are also many times larger:
; ls $(find ~/.pex/packed_wheels -name '*tensorflow_gpu*whl')
# -rw-rw-r-- 1 cosmicexplorer cosmicexplorer 468M Aug 1 09:10 /home/cosmicexplorer/.pex/packed_wheels/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
# -rw-rw-r-- 1 cosmicexplorer cosmicexplorer 1.2G Aug 1 09:16 /home/cosmicexplorer/.pex/packed_wheels/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/un-compressed/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
Solution
- Introduce
pex.ziputils.MergeableZipFileto incorporate the contents of other zip files without decompressing them.- Introduce
pex.common.copy_file_range()to copy over a limited range of the contents of another file handle.
- Introduce
- Introduce the
--cache-distsoption to toggle the use (and/or creation) of the packed wheel caches for.bootstrap/and.deps/subdirectories. - Consume the existing
--layout packedcaches for--layout zipappwhen--cache-distsis provided.
Reading @jsirois's thoughts at https://github.com/pantsbuild/pex/issues/2158#issuecomment-1599342665 again:
Since this would only be used by the Pex CLI (build-time), a rust implementation is actually how I'd approach this. It's the right language for the job, the problem really has nothing to do with Pex (it has much wider applicability), and there are great tools for turning a crate into an easily consumable Python platform-specific distribution.
I'll take a look at pyoxy and friends to convert the crate into a python distribution. I'll leave this as a draft until then, and I'll re-ping reviewers at that time.
Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/.
Ah, wonderful, I've just found the docs for pyo3 https://docs.rs/pyo3/latest/pyo3/.
Great. Yeah @cosmicexplorer that's exactly what I meant. I've been away climbing here and haven't taken a look yet, but if this all looks good, consuming as a native wheel was the idea.
Ok, just had a lot of fun converting this into a pyo3 package and the result runs even faster! This branch isn't passing because I've added some hacks to get it running, but I will be working to remove those!
Just refactored this immensely after @jsirois's suggestion to use @stuhood's strategy of zip -FF.
That said, it really does beg the question of why not just use
--layout packed
I understand that --layout packed was created specifically for users seeking to maximize cacheability, especially across rsync, and I don't plan to put any more effort into making --layout zipapp more cacheable.
I think this change produces a really drastic and noticeable performance improvement for an extremely common use case (especially the use of pex in prototyping), and I was really excited to figure out how to avoid introducing any new types of output formats. I hope this change can be completely invisible to users (except as a noticeable perf bump).
I think this change produces a really drastic and noticeable performance improvement for an extremely common use case (especially the use of pex in prototyping)
Why not use --layout packed or --layout loose for that? The latter should be the faster yet for prototyping. Adding new features is expensive as you know; so I just want to nail down how this actually makes some real prototype scenario you have or have in mind faster / easier / better. That detail will help assess weight of more code / cache size, etc vs benefit.
I used the term "prototyping" very loosely. I was merely emphasizing how often --layout zipapp is employed for the user's convenience. Our docs at https://pex.readthedocs.io/en/v2.1.140/buildingpex.html#saving-pex-files don't mention the other layouts, so a lot of people will be using these commands.
Possibly more importantly, I assume this speedup will also be felt by pants, which often reconstitutes various intermediate pexes full of requirements, using --layout zipapp, in order to perform some task like running pytest. I was halfway to making a repro case in the pants repo but gave up for now.
I used the term "prototyping" very loosely. I was merely emphasizing how often --layout zipapp is employed for the user's convenience. Our docs at https://pex.readthedocs.io/en/v2.1.140/buildingpex.html#saving-pex-files don't mention the other layouts, so a lot of people will be using these commands.
Ok. The docs are in a sorry state to be sure, but that can be remedied. Since Pex has a rule of not breaking existing users, --layout zipapp is the default, so more convenient in that you need not type. But there is shell aliases to help with that.
If you don't have an actual use case in mind, that's not awesome. There is an expense here in ~duplicating cached zips and Pants / Pex are already both notorious amongst users for excessive cache sizes. If I were working on this problem, I'd want to take a hard look at the existing packed layout and the needs of the zipapp layout and seeing if there were a way to make them be able to share the same zips. Without that, this feature definitely needs to be behind a flag (--i-opt-in-to-cache-doubling - clearly not spelled like that!). Now you already mentioned being behind a flag, so I think you're on board there. But, then the convenience argument above goes poof. So I'd really like to either understand a unique use case that makes this better than --layout packed or else an idea about how to change things so that cached zips can be shared.
Possibly more importantly, I assume this speedup will also be felt by pants, which often reconstitutes various intermediate pexes full of requirements, using --layout zipapp, in order to perform some task like running pytest. I was halfway to making a repro case in the pants repo but gave up for now.
Pants in fact uses --layout packed exclusively for all internal Python operations. It only produces a zip when asked via the package goal (AWS Lambda zips, and pex_binary targets with the default zipapp layout).
I wanted to note I won't be back online until ~Friday August 4th.
Ok. The docs are in a sorry state to be sure, but that can be remedied.
That wasn't my intention -- please don't assume antagonism. I was attempting to justify my impression that --layout zipapp was a widely used standard/default for pex, which is why I had assumed this would be a more immediately attractive improvement. I would love to help improve the docs situation here!
so more convenient in that you need not type
--layout zipapp is also more convenient for many deployment scenarios in that it produces a single output file, which is one of the main use cases I have for the pex utility.
Pants in fact uses
--layout packedexclusively for all internal Python operations. It only produces a zip when asked via the package goal (AWS Lambda zips, and pex_binary targets with the default zipapp layout).
Ok great, I appreciate taking the time to provide the larger context here. In the case of producing a pex_binary() with many large 3rdparty deps, such as often occurs with machine learning code, pants users currently have to choose between waiting over a minute or accepting a massive uncompressed zip, even where the exact same command has just been run and everything possible is cached.
an idea about how to change things so that cached zips can be shared.
I think python's zipfile may be able to copy over entries without decompressing and recompressing them the way the rust zip crate does. That would be how I would accomplish this while renaming the entries (since the names are not part of the compressed data).
After a bit of investigation, it seems all the necessary tools to reuse --layout packed cache archives for --layout zipapp are available in the python zipfile library through 2.7.
Ok, I didn't think it would be possible, but @jsirois (who is off climbing now) was absolutely right to suggest investigating the reuse of --layout packed cache entries. This solution:
- doesn't rely on any external binaries or libraries,
- doesn't introduce any new cache entries,
- doesn't rely on any private or undocumented methods,
- does produce standards-compliant zip files! You'll notice it looks much more similar to the previous implementation.
The magic trick is in pex.ziputils.MergeableZipFile, a subclass of zipfile.ZipFile with a method .merge_archive() that loops over entries from a source ZipFile and accesses its (public!) file pointer attribute .fp to directly copy over compressed file contents.
@Eric-Arellano @stuhood: absolutely zero rush on this, but since @jsirois is out right now, please feel free to leave any thoughts/comments whenever you find the time.
As a side note, I'm planning to submit a proposal to PackagingCon about this + pypa/pip#12186 today/this week (the cfp closes August 6th).
One thing it would be very helpful to get input on: there are two changes in behavior in the current diff (both of which are subjective, not necessary to achieve the goals of this diff, and can be reverted):
- source files now come before
.bootstrap/in the output ofunzip -lfor--layout zipapp: https://github.com/pantsbuild/pex/blob/7130bd1c57bd00cd81581e0b901c4aba740b5a15/pex/pex_builder.py#L855-L858 - all entries in
~/.pex/{bootstrap_zips,packed_wheels}/havedeterministic_timestamp=True: https://github.com/pantsbuild/pex/blob/7130bd1c57bd00cd81581e0b901c4aba740b5a15/pex/pex_builder.py#L816-L820
I suspect the right thing to do is to revert to the prior behavior in this change, then split out separate PRs to propose both of these behavioral changes, but I would like to know whether these changes in behavior are considered modifying the public API.
I'm reviewing presently, but as to:
- source files now come before .bootstrap/ in the output of unzip -l for --layout zipapp:
This is not good for 2 reasons:
- Zips are tail oriented; so having
__main__.pyin the tail is desirable. That file is executed on every run (unless you use--sh-boot) from within the zip even if the PEX zip has previously been extracted / installed and if it was already loaded into the page cache when the end of central directory record was read, that's a small overhead win. - When I run
zipinfo my.pextoday, the.bootstrap/, etc scroll up and away and I see my user code at the bottom in the console.
Python is so slow compared to (SSD) disk accesses that 1 may not be all that relevant any longer and 2 is eye of the beholder - presumably your zipinfo environment is in a pager and so you see the 1st page 1st. Either way, I think this change is, on the whole, not justified.
- all entries in ~/.pex/{bootstrap_zips,packed_wheels}/ have deterministic_timestamp=True:
Seems right to me.
When I run zipinfo my.pex today, the .bootstrap/, etc scroll up and away and I see my user code at the bottom in the console.
Oops! I was so used to .deps/ being quite large for the test cases I've been playing with here that I hadn't realized you're describing the more usual case! I had been repeatedly piping everything through less. This one is easy to revert; however:
Zips are tail oriented; so having main.py in the tail is desirable.
If I'm hearing you right, it sounds like you're advocating for something a little different: the ordering of:
.bootstrap/.deps/- sources, metadata and runner, incl
PEX-INFOand__main__.py
Right now, it's .bootstrap/, then sources, then .deps/ is at the end. To address both of your reasons, it sounds like we should change the behavior from the current? I can push a diff to show what that would look like!
Oops! Didn't realize that was the current behavior! However, regarding the following:
Zips are tail oriented; so having main.py in the tail is desirable.
Would it be useful to put __main__.py at the literal very end, then?
If I'm hearing you right, it sounds like you're advocating for something a little different: the ordering of:
I'm advocating for don't muck with essentially unrelated stuff in this PR. Please keep the status quo. That status quo, right or wrong, is this:
$ pex -V
2.1.140
$ touch foo.py bar.py
$ pex cowsay --module foo --module bar -o cowsay.pex
$ zipinfo -1 cowsay.pex
.bootstrap/
.bootstrap/pex/
...
.bootstrap/pex/version.py
.bootstrap/pex/ziputils.py
.deps/
.deps/cowsay-5.0-py2.py3-none-any.whl/
...
.deps/cowsay-5.0-py2.py3-none-any.whl/cowsay-5.0.dist-info/entry_points.txt
.deps/cowsay-5.0-py2.py3-none-any.whl/cowsay-5.0.dist-info/top_level.txt
PEX-INFO
__main__.py
__pex__/
__pex__/__init__.py
bar.py
foo.py
Your contributions are very welcome, but I'm having flashbacks of the wild sprawling nature these things tended to have and that part is very unwelcome as a reviewer, especially a plodding, single threaded reviewer like me. Please keep the exposed products of your work as focused as possible regardless of how sprawling your internal work process may be.
Please see https://github.com/pantsbuild/pex/pull/2175#issuecomment-1665770648, where you'll find I had a simple misunderstanding. I understand you are not at liberty to review in depth right now.
Would it be useful to put main.py at the literal very end, then?
Yes, but __pex__/__init__.py also has the same role when the PEX is used as a sys.path entry; i.e.: PYTHONPATH=my.pex python -c 'from __pex__ import cowsay; ...' (this is a newish feature you may not be aware of). That said, I think some PR later should probably make the tail order exactly:
__pex__/
__pex__/__init__.py
__main__.py
The __main__.py entrypoint case will probably always be more commonly used than the __pex__ magic import hook.
Just noticed this:
Ok. The docs are in a sorry state to be sure, but that can be remedied.
That wasn't my intention -- please don't assume antagonism.
I actually mean it. The docs are in a sorry state. They are horrible. I have not spent any time on them in the ~5 years I've been working as the primary Pex maintainer; meanwhile I have spent time adding a lot of features for Pants; so the docs are severely out of date. I do plan to circle back though. I've vetted out a whole system over at https://github.com/a-scie/lift based on Sphinx that I will be applying here along with fully re-worked content.
That said, I think some PR later should probably make the tail order exactly:
On it! #2209
@jsirois you already approved this, but I assumed you'd want to review again after I introduced the --cache-dists option to gate the use of the packed wheel cache (--cache-dists is default off for zipapps, but default on for packed, in order to maintain the current behavior). I have also significantly shorted the commit message, although in this case I've kept the Problem/Solution structure as I feel that describing the tradeoff vs --no-compress in the Alternatives section was worth specifically calling out.
Re the commit description, I'm still not a fan of the formula - you can say the exact same thing without the damn template. But that's just my grumpy dislike of the 5 paragraph essay. More importantly you just whiff the actual alternative, which is the packed layout. I think you should address why that doesn't work.
@cosmicexplorer although this PR is approved, another thing to weigh is #2298 which provides even greater speedups with 0 hacks by skipping zipping entirely with a bonus of skipping unzipping (pre-installing) as well.
Hey @jsirois, thanks so much for your thoughtful review. I'll be rebasing and responding to your comments, although I understand that pex has changed a bit in the meantime. I think this is a useful change and I have the time to see it through.
Ah, I'll read through #2298 before doing that. I know you've been pushing back on adding this PR's complexity to pex for a while, and I've since found that some of the tricks here are probably better suited for the rust zip crate. I'll read that change and re-evaluate this now. Thanks.
Ok, it seems like this change will still provide a speedup in zipapp creation, without incurring the (very slight = O(100ms)) cold-start performance slowdown of --no-pre-install-wheels. I personally think it's still quite useful to get a build performance improvement "for free" with --layout zipapp, since I believe that's still the default and rather convenient. If you agree, I'll rebase and follow up with this until it's acceptable.
Ok, I totally buy this comment: https://github.com/pex-tool/pex/pull/2175#discussion_r1288753565. I think I'm willing to close this and move the zip file magic to my work on the zip crate. I was really surprised/impressed that we were able to do this, but --no-pre-install-wheels seems like the right way to do this. Thanks for talking me through this: I really appreciate your patience and I always learn from your expertise.