pex
pex copied to clipboard
proposal: `pex3 cache` introspection/gc command
Discussed in https://github.com/pantsbuild/pex/discussions/2200
Originally posted by cosmicexplorer August 1, 2023 forked from a response to https://github.com/pantsbuild/pex/pull/2175#issuecomment-1652279145:
There is an expense here in ~duplicating cached zips and Pants / Pex are already both notorious amongst users for excessive cache sizes Without that, this feature definitely needs to be behind a flag (--i-opt-in-to-cache-doubling - clearly not spelled like that!). Now you already mentioned being behind a flag, so I think you're on board there.
Cache GC Policies
Generalizing this a bit, I recall that pantsd used to have a flag for how often it garbage collects the rust store--if there are concerns about the bloat of pex cache directories, are there any opportunities for pex itself to help the user automate the cache management outside of just rm -rf ~/.pex
? What is currently the easiest way to implement e.g. LRU eviction? I guess I can do something like this?
> find ~/.pex -atime '+30' -or -atime '+7' -size '+300M' -type f -exec rm -rf '+'
The above probably works, but I'm wondering if the dilemma about cache bloat that you describe is partially because the user isn't given enough tools to mediate it? Or am I misinterpreting you?
Insight: evict cache entries based on usage frequency
In particular, one GC heuristic that pex (or pip) itself would be in the best place to record is not just how recently each cache entry was accessed, but how often. Something like this could be fun:
> pex3 cache evict -accessed '>30 days' -or \( -size '>300M' -accessed '<1 per 1 day' \)
448M ~/.pex/stitched_dists/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
...
2.5GB to be removed. Delete? [Y/n] y
2.5GB deleted.
Does that sound like a fruitful thing to investigate further? Or are there better ways to address the disk usage pressure?
Prior Art
Examples of this from other tools:
pip
example
One useful bit of prior art is the new pip cache
subcommand within pip (it's on the main
branch, not sure which version it first appeared in):
> PYTHONPATH="$(pwd)/dist/pip-23.3.dev0-py3-none-any.whl:${PYTHONPATH}" python3.12 -m pip cache list --help
Usage:
/home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache dir
/home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache info
/home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache list [<pattern>] [--format=[human, abspath]]
/home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache remove <pattern>
/home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache purge
Description:
Inspect and manage pip's wheel cache.
Subcommands:
- dir: Show the cache directory.
- info: Show information about the cache.
- list: List filenames of packages stored in the cache.
- remove: Remove one or more package from the cache.
- purge: Remove all items from the cache.
``<pattern>`` can be a glob expression or a package name.
Cache Options:
--format <list_format> Select the output format among: human (default) or abspath
# ...
> PYTHONPATH="$(pwd)/dist/pip-23.3.dev0-py3-none-any.whl:${PYTHONPATH}" python3.12 -m pip cache list 'wheel*'
Cache contents:
- wheel-0.40.0-py3-none-any.whl (64 kB)
- wheel-0.40.0-py3-none-any.whl (64 kB)
- wheel-0.41.0-py3-none-any.whl (64 kB)
- wheel-0.41.0-py3-none-any.whl (64 kB)
spack
comparison
I know spack
users also have the same issue, but it's less pressing because:
- spack's filesystem usage is largely dominated by the contents of the packages it installs, which (because they do not come in the formats expected by standard package repositories) are often so large that caching like pex or pants does would much more quickly result in uncomfortable disk usage.
- spack specs allow very powerful queries, which makes it easier to implement e.g. "uninstall all versions of emacs without the tree-sitter library" (that looks like
spack uninstall 'emacs~tree-sitter'
) or "anything compiled by a version of clang less than or equal to X.Y.Z and any transitive dependees" (that looks likespack uninstall --all '%clang@:X.Y.Z'
) by deferring to the clingo ASP logic solver (e.g. https://github.com/spack/spack/blob/936c6045fc0686e683c6b3da20967d2e30a7ec87/lib/spack/spack/solver/concretize.lp#L7).
So spack users generally have the ability to very finely tune the tool's disk usage to suit their own immediate needs, and pruning or even seeding a cache e.g. for export to an internal environment is considered a top-level feature. While pex
(and especially pex3
) also make the creation of python environments a top-level feature, we currently aren't able to apply the same selection logic to prune our cache directories.
Insight: select cache entries to evict using our existing platform/interpreter selection logic
Along those lines, to expand on the proposed pex3 cache
command, we could introduce platform selection logic:
> pex3 cache evict -platform 'linux'
448M ~/.pex/stitched_dists/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
...
2.5GB to be removed. Delete? [Y/n] y
2.5GB deleted.
```</div>
I would love any comments from anyone at all about how they currently manage the pex cache (or don't)!
A user request for cache clearing from pip: https://github.com/pypa/pip/issues/12176.
I think starting things off with 0 magic would be great. Simply supporting pex3 cache purge
in a multi-process safe way would be great. That would be == to rm -rf ~/.pex
, except multi-process parallel safe which is a prerequisite in my opinon to any cache tooling. From there more commands could be added. Even just all or nothing manual ~pex3 cache purge --type {downloads,pip,installed-wheels}
would likely be useful before even contemplating more complex things like tracking usage, recency, etc.
I am requesting a pex3 cache add
command so the pex cache can be hydrated with external artifacts like npm. This would enable building PEXes in an offline environment without too much extra labour.
@zmanji that seems to make sense, although there is a detail to iron out: what does cache add for an sdist mean? Naively this builds and installs the sdist in ~/.pex/installed_wheels
for the current interpreter running Pex, but that clearly could be deemed confusing; so there is a bit of a can of worms. Maybe you can say pex3 cache add --python <this one> <sdist>
?
All that said, I don't think this will help your offline goals. Pip does more work than you think it does even if you hand it all pinned deps and Pex doesn't currently try to ameliorate that. I just recently added tox -e<test env> -- --devpi
support to Pex's build environment and that truly allows for offline operation (not currently, I don't plumb --offline
to devpi-server, but I could, and in-practice operation is ~offline with a warm cache). Devpi is a great tool for this specialized to the job and useable in more contexts than just Pex. If you haven't taken a look, it might be worth your while.
The pex3 cache {dir,info,purge}
command now exists. Although pex3 cache purge
only allows purging either the whole Pex cache or individual verticals, it does so safely with a lock in place that prevents in-flight PEXes from observing partial cache entries. Follow ups will now be able to address purging more detailed items like individual project dependencies and LRU style cutoffs when the filesystem supports atime, etc.
I am requesting a pex3 cache add command so the pex cache can be hydrated with external artifacts like npm. This would enable building PEXes in an offline environment without too much extra labor.
@zmanji I think pex3 cache add/remove
for adding or removing certain project distributions makes sense. That said, they should probably be added as a pair and remove requires dependency tracking since both unzipped_pexes and venvs caches can symlink into the wheel cache. I broke out #2528 for this.
That said, I'm adding a new resolver type for #1907 in #2512 that allows fully offline PEX creation using --pre-resolved-dists /my/wheelhouse
. The wheelhouse can contain sdists, but, if so, it must also contain any wheels needed to build those sdists (setuptools, hatch, poetry, pdm, etc). This behaves exactly like --no-pypi --find-links /my/wheelhouse
except that Pip is not used at all and Pex resolves requirements solely from the distributions already downloaded in the directory. The only caveat being those distributions must form a resolve solution; i.e. be the result of a prior pip download -d /my/wheelhouse ...
or pip wheel -w /my/wheelhouse ...
.