pex icon indicating copy to clipboard operation
pex copied to clipboard

proposal: `pex3 cache` introspection/gc command

Open cosmicexplorer opened this issue 1 year ago • 5 comments

Discussed in https://github.com/pantsbuild/pex/discussions/2200

Originally posted by cosmicexplorer August 1, 2023 forked from a response to https://github.com/pantsbuild/pex/pull/2175#issuecomment-1652279145:

There is an expense here in ~duplicating cached zips and Pants / Pex are already both notorious amongst users for excessive cache sizes Without that, this feature definitely needs to be behind a flag (--i-opt-in-to-cache-doubling - clearly not spelled like that!). Now you already mentioned being behind a flag, so I think you're on board there.

Cache GC Policies

Generalizing this a bit, I recall that pantsd used to have a flag for how often it garbage collects the rust store--if there are concerns about the bloat of pex cache directories, are there any opportunities for pex itself to help the user automate the cache management outside of just rm -rf ~/.pex? What is currently the easiest way to implement e.g. LRU eviction? I guess I can do something like this?

> find ~/.pex -atime '+30' -or -atime '+7' -size '+300M' -type f -exec rm -rf '+'

The above probably works, but I'm wondering if the dilemma about cache bloat that you describe is partially because the user isn't given enough tools to mediate it? Or am I misinterpreting you?

Insight: evict cache entries based on usage frequency

In particular, one GC heuristic that pex (or pip) itself would be in the best place to record is not just how recently each cache entry was accessed, but how often. Something like this could be fun:

> pex3 cache evict -accessed '>30 days' -or \( -size '>300M' -accessed '<1 per 1 day' \)
448M    ~/.pex/stitched_dists/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
...
2.5GB to be removed. Delete? [Y/n] y
2.5GB deleted.

Does that sound like a fruitful thing to investigate further? Or are there better ways to address the disk usage pressure?

Prior Art

Examples of this from other tools:

pip example

One useful bit of prior art is the new pip cache subcommand within pip (it's on the main branch, not sure which version it first appeared in):

> PYTHONPATH="$(pwd)/dist/pip-23.3.dev0-py3-none-any.whl:${PYTHONPATH}" python3.12 -m pip cache list --help

Usage:   
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache dir
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache info
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache list [<pattern>] [--format=[human, abspath]]
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache remove <pattern>
  /home/cosmicexplorer/.pyenv/versions/3.12.0a7/bin/python3.12 -m pip cache purge
  

Description:
  Inspect and manage pip's wheel cache.
  
  Subcommands:
  
  - dir: Show the cache directory.
  - info: Show information about the cache.
  - list: List filenames of packages stored in the cache.
  - remove: Remove one or more package from the cache.
  - purge: Remove all items from the cache.
  
  ``<pattern>`` can be a glob expression or a package name.

Cache Options:
  --format <list_format>      Select the output format among: human (default) or abspath
# ...
> PYTHONPATH="$(pwd)/dist/pip-23.3.dev0-py3-none-any.whl:${PYTHONPATH}" python3.12 -m pip cache list 'wheel*'
Cache contents:

 - wheel-0.40.0-py3-none-any.whl (64 kB)
 - wheel-0.40.0-py3-none-any.whl (64 kB)
 - wheel-0.41.0-py3-none-any.whl (64 kB)
 - wheel-0.41.0-py3-none-any.whl (64 kB)

spack comparison

I know spack users also have the same issue, but it's less pressing because:

  1. spack's filesystem usage is largely dominated by the contents of the packages it installs, which (because they do not come in the formats expected by standard package repositories) are often so large that caching like pex or pants does would much more quickly result in uncomfortable disk usage.
  2. spack specs allow very powerful queries, which makes it easier to implement e.g. "uninstall all versions of emacs without the tree-sitter library" (that looks like spack uninstall 'emacs~tree-sitter') or "anything compiled by a version of clang less than or equal to X.Y.Z and any transitive dependees" (that looks like spack uninstall --all '%clang@:X.Y.Z') by deferring to the clingo ASP logic solver (e.g. https://github.com/spack/spack/blob/936c6045fc0686e683c6b3da20967d2e30a7ec87/lib/spack/spack/solver/concretize.lp#L7).

So spack users generally have the ability to very finely tune the tool's disk usage to suit their own immediate needs, and pruning or even seeding a cache e.g. for export to an internal environment is considered a top-level feature. While pex (and especially pex3) also make the creation of python environments a top-level feature, we currently aren't able to apply the same selection logic to prune our cache directories.

Insight: select cache entries to evict using our existing platform/interpreter selection logic

Along those lines, to expand on the proposed pex3 cache command, we could introduce platform selection logic:

> pex3 cache evict -platform 'linux'
448M    ~/.pex/stitched_dists/7a4763a35d0824ebb172b00f8d0241ff231404c4b7d97dd5ea870d5afca336a4/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl
...
2.5GB to be removed. Delete? [Y/n] y
2.5GB deleted.
```</div>

cosmicexplorer avatar Aug 01 '23 17:08 cosmicexplorer

I would love any comments from anyone at all about how they currently manage the pex cache (or don't)!

cosmicexplorer avatar Aug 01 '23 17:08 cosmicexplorer

A user request for cache clearing from pip: https://github.com/pypa/pip/issues/12176.

cosmicexplorer avatar Aug 01 '23 17:08 cosmicexplorer

I think starting things off with 0 magic would be great. Simply supporting pex3 cache purge in a multi-process safe way would be great. That would be == to rm -rf ~/.pex, except multi-process parallel safe which is a prerequisite in my opinon to any cache tooling. From there more commands could be added. Even just all or nothing manual ~pex3 cache purge --type {downloads,pip,installed-wheels} would likely be useful before even contemplating more complex things like tracking usage, recency, etc.

jsirois avatar Aug 03 '23 23:08 jsirois

I am requesting a pex3 cache add command so the pex cache can be hydrated with external artifacts like npm. This would enable building PEXes in an offline environment without too much extra labour.

zmanji avatar Aug 26 '23 03:08 zmanji

@zmanji that seems to make sense, although there is a detail to iron out: what does cache add for an sdist mean? Naively this builds and installs the sdist in ~/.pex/installed_wheels for the current interpreter running Pex, but that clearly could be deemed confusing; so there is a bit of a can of worms. Maybe you can say pex3 cache add --python <this one> <sdist>?

All that said, I don't think this will help your offline goals. Pip does more work than you think it does even if you hand it all pinned deps and Pex doesn't currently try to ameliorate that. I just recently added tox -e<test env> -- --devpi support to Pex's build environment and that truly allows for offline operation (not currently, I don't plumb --offline to devpi-server, but I could, and in-practice operation is ~offline with a warm cache). Devpi is a great tool for this specialized to the job and useable in more contexts than just Pex. If you haven't taken a look, it might be worth your while.

jsirois avatar Aug 26 '23 19:08 jsirois

The pex3 cache {dir,info,purge} command now exists. Although pex3 cache purge only allows purging either the whole Pex cache or individual verticals, it does so safely with a lock in place that prevents in-flight PEXes from observing partial cache entries. Follow ups will now be able to address purging more detailed items like individual project dependencies and LRU style cutoffs when the filesystem supports atime, etc.

jsirois avatar Sep 04 '24 02:09 jsirois

I am requesting a pex3 cache add command so the pex cache can be hydrated with external artifacts like npm. This would enable building PEXes in an offline environment without too much extra labor.

@zmanji I think pex3 cache add/remove for adding or removing certain project distributions makes sense. That said, they should probably be added as a pair and remove requires dependency tracking since both unzipped_pexes and venvs caches can symlink into the wheel cache. I broke out #2528 for this.

That said, I'm adding a new resolver type for #1907 in #2512 that allows fully offline PEX creation using --pre-resolved-dists /my/wheelhouse. The wheelhouse can contain sdists, but, if so, it must also contain any wheels needed to build those sdists (setuptools, hatch, poetry, pdm, etc). This behaves exactly like --no-pypi --find-links /my/wheelhouse except that Pip is not used at all and Pex resolves requirements solely from the distributions already downloaded in the directory. The only caveat being those distributions must form a resolve solution; i.e. be the result of a prior pip download -d /my/wheelhouse ... or pip wheel -w /my/wheelhouse ....

jsirois avatar Sep 13 '24 16:09 jsirois