pip icon indicating copy to clipboard operation
pip copied to clipboard

cache metadata lookups for sdists and lazy wheels

Open cosmicexplorer opened this issue 2 years ago • 18 comments

This PR is on top of #12186; see the +392/-154 diff against it at https://github.com/cosmicexplorer/pip/compare/metadata_only_resolve_no_whl...cosmicexplorer:pip:link-metadata-cache?expand=1.

  • Fixes #11165.

Background: Pip Bandwidth Usage Improvements

In 2016, @dstufft tweeted:

In the last 24 hours PyPI has served 120 million requests and used 13.7 TB of bandwidth.

Today, @sethmlarson tweeted:

urllib3 recently reached 10 billion total downloads!! 🎉🔟🥳

Since PEP 658 from @uranusjr was implemented via #11111 and pypi/warehouse#13649, pip has been making use of shallow metadata files to resolve against, meaning that instead of pulling down several very large wheels over the course of a single pip install invocation, only to throw away most of them, we now only download and prepare a single version of each requirement at the conclusion of each pip subcommand. With #12186, we can even avoid downloading any dists at all if we only want to generate an install report with pip install --report --dry-run. For find-links or other non-pypi indices which haven't implemented PEP 658, #12208 improves the fast-deps feature to enable metadata-only resolves against those too.

Proposal: Caching Metadata to Reduce Number of Requests

The above is all great, but despite reducing bandwidth usage, we haven't yet reduced the total number of requests. As described in #12184, I believe we can safely cache metadata lookup to both drastically improve pip's runtime as well as significantly reduce the number of requests against pypi.

This caching is made safe by extending @sbidoul's prior work (see e.g. #7333, #7296) to ensure pip._internal.cache.Cache incorporates all the information that may change the compatibility of a downloaded/prepared distribution resolved from a Link.

Result: 6.5x Speedup, Shaving Off 30 Seconds of Runtime

This change produces a 6.5x speedup against the example tested below, reducing the runtime of this pip install --report command from over 36 seconds down to just 5.7 seconds:

# this branch is #12186, which this PR is based off of
> git checkout metadata_only_resolve_no_whl
> python3.8 -m pip install --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m36.410s
user    0m15.706s
sys     0m13.377s
# switch to this PR
> git checkout link-metadata-cache
# enable --use-feature=metadata-cache
> python3.8 -m pip install --use-feature=metadata-cache --dry-run --ignore-installed --report test.json --use-feature=fast-deps 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3'
...
real    0m5.671s
user    0m4.429s
sys     0m0.123s

But possibly more importantly, it also drastically reduces the number of requests made against pypi (this is reflected in the much lower sys time in the second command output above), without introducing significant disk space usage (only ~60KB, because we only cache metadata for each dist, and we compress the serialized cache entries):

> du -b -s ~/.cache/pip/link-metadata/
60238   /home/cosmicexplorer/.cache/pip/link-metadata/
> find ~/.cache/pip/link-metadata/ -type f | parallel bzcat | jq
...
{
  "metadata": "Metadata-Version: 2.1\nName: h5py\nVersion: 3.1.0\nSummary: Read and write HDF5 files from Python\nHome-page: http://www.h5py.org\nAuthor: Andrew Collette\nAuthor-email: [email protected]\nMaintainer: Andrew Collette\nMaintainer-email: [email protected]\nLicense: BSD\nDownload-URL: https://pypi.python.org/pypi/h5py\nPlatform: UNKNOWN\nClassifier: Development Status :: 5 - Production/Stable\nClassifier: Intended Audience :: Developers\nClassifier: Intended Audience :: Information Technology\nClassifier: Intended Audience :: Science/Research\nClassifier: License :: OSI Approved :: BSD License\nClassifier: Programming Language :: Cython\nClassifier: Programming Language :: Python\nClassifier: Programming Language :: Python :: 3\nClassifier: Programming Language :: Python :: 3.6\nClassifier: Programming Language :: Python :: 3.7\nClassifier: Programming Language :: Python :: Implementation :: CPython\nClassifier: Topic :: Scientific/Engineering\nClassifier: Topic :: Database\nClassifier: Topic :: Software Development :: Libraries :: Python Modules\nClassifier: Operating System :: Unix\nClassifier: Operating System :: POSIX :: Linux\nClassifier: Operating System :: MacOS :: MacOS X\nClassifier: Operating System :: Microsoft :: Windows\nRequires-Python: >=3.6\nRequires-Dist: cached-property ; python_version < \"3.8\"\nRequires-Dist: numpy (>=1.12) ; python_version == \"3.6\"\nRequires-Dist: numpy (>=1.14.5) ; python_version == \"3.7\"\nRequires-Dist: numpy (>=1.17.5) ; python_version == \"3.8\"\nRequires-Dist: numpy (>=1.19.3) ; python_version >= \"3.9\"\n\n\nThe h5py package provides both a high- and low-level interface to the HDF5\nlibrary from Python. The low-level interface is intended to be a complete\nwrapping of the HDF5 API, while the high-level component supports  access to\nHDF5 files, datasets and groups using established Python and NumPy concepts.\n\nA strong emphasis on automatic conversion between Python (Numpy) datatypes and\ndata structures and their HDF5 equivalents vastly simplifies the process of\nreading and writing data from Python.\n\nSupports HDF5 versions 1.8.4 and higher.  On Windows, HDF5 is included with\nthe installer.\n\n\n",
  "filename": "h5py-3.1.0-cp38-cp38-manylinux1_x86_64.whl",
  "canonical_name": "h5py"
}

TODO

  • [ ] Merge #12186.

cosmicexplorer avatar Sep 03 '23 05:09 cosmicexplorer

@ewdurbin: regarding https://github.com/pypa/pip/pull/12257#issuecomment-1704241741, I don't think I'd tagged you in this change that will actually reduce requests against files.pythonhosted.org as well!

cosmicexplorer avatar Sep 03 '23 12:09 cosmicexplorer

@cosmicexplorer unfortunately I don't have enough time available at the moment to delve into your PRs which all sound very interesting.

This one caught my attention, though, as it is something I also contemplated to do previously. So I gave it a quick try. And I have questions :)

  • When running pip install --cache-dir=/tmp/pipcache --no-binary=:all: --use-feature=metadata-cache git+https://github.com/pypa/pip-test-package it creates a link-metadata cache entry. Actually, it should not since this link is not cacheable because the head of the branch can change. I tend to think a metadata caching mechanism should somehow re-use the same _should_cache() logic we have to decide whether locally built wheels should be cached or not.
  • When running pip install --cache-dir=/tmp/pipcache --use-feature=metadata-cache lxml which downloads and installs a wheel, it creates a link-metadata cache entry. Is that intended? Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?
  • When running pip install --cache-dir=/tmp/pipcache --use-feature=metadata-cache --no-binary=:all: lxml which downloads and builds a sdist, it also creates a metadata cache entry. One potential issue is that the cache entry file name seems to be created from a hash of the link only (IIUC). However, the metadata can be different depending on the platform on which it has been generated. My intuition is that the compatibility tags we have in the wheels could play a role here, although I don't know how to obtain them without doing a full wheel build.

sbidoul avatar Sep 03 '23 13:09 sbidoul

@sbidoul: thank you so much for your thoughtful feedback!! ^_^

When running pip install --cache-dir=/tmp/pipcache --no-binary=:all: --use-feature=metadata-cache git+https://github.com/pypa/pip-test-package it creates a link-metadata cache entry

This is fixed in https://github.com/pypa/pip/pull/12256/commits/55f185aaaf82d3211e8aa71cf5cf99cf6ed64943, which avoids creating cache entries for git or file urls.

Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?

I am looking into this now. I agree that the http cache should retain the .metadata from the index, so I will make sure that is working. I also agree that it should be able to pull from the cached wheel, if available.

One potential issue is that the cache entry file name seems to be created from a hash of the link only (IIUC). However, the metadata can be different depending on the platform on which it has been generated. My intuition is that the compatibility tags we have in the wheels could play a role here, although I don't know how to obtain them without doing a full wheel build.

So you're absolutely right about this, but I was under the impression that I had solved this by directly making use of your work in pip._internal.cache.Cache._get_cache_path_parts() from ab0593659df4e51748636e2e0068224f83a9b1e6 to solve #7296, which incorporates the current interpreter name and version. As you said:

I tend to think a metadata caching mechanism should somehow re-use the same _should_cache() logic we have to decide whether locally built wheels should be cached or not.

Please no rush to respond, I really appreciate your feedback.

cosmicexplorer avatar Sep 03 '23 13:09 cosmicexplorer

Is not sufficient to rely on the http cache obtain the .metadata from the index, or if not available the wheel ?

After a lengthy investigation, I've discovered that the http cache from CacheControlAdapter will still make GET requests against files.pythonhosted.org, even when the file is already in the http cache. The response from the server for such requests is ignored. So while debug output with -vvv shows e.g. The response is "fresh", returning cached response, we will still make a request against pypi for each .metadata or .whl. It is nontrivial to work around this.

In #12257 we were able to make some GET requests much faster by adding HTTP caching headers to make pypi return an empty 304 Not Modified, but there are two issues with applying that here:

  1. files.pythonhosted.org doesn't return a 0-length 304 Not Modified, it always returns a full-length 200 OK.
  2. unlike #12257, we are dealing with immutable links. We don't need to check whether the result has changed.

I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our CacheControlAdapter. What do you think @sbidoul?

cosmicexplorer avatar Sep 03 '23 22:09 cosmicexplorer

After a lengthy investigation, I've discovered that the http cache from CacheControlAdapter will still make GET requests against files.pythonhosted.org, even when the file is already in the http cache.

Is this by design, or is it a bug? If the latter, I think we’d be better just getting the bug fixed…

pfmoore avatar Sep 04 '23 07:09 pfmoore

I think this PR is the cleanest way to avoid making those unused GET requests against files.pythonhosted.org by making use of the existing "metadata-only fetch" concept, without having to mess around with our CacheControlAdapter.

I perceive several aspects to this metadata caching topic:

  1. Caching metadata we computed locally is something I consider useful (#11165), because there are several scenario where pip downloads a sdist or a VCS repo to compute the metadata and then discard it, if a wheel is not built for any reason (already installed, rejected version, ...). The decision to cache the metadata or not and the way to structure the cache should likely be as similar as possible to the cache of locally built wheels, for least-surprise behaviour.

  2. Caching .metadata files from indexes. I can't say I fully understand the subtleties of the interactions between the pip http cache and PyPI, but my intuition says these should be cached at the http level (like wheels), and that any potential improvements should be investigated in details in the CacheControl area first before leaking that caching complexity elsewhere in the code base.

  3. Caching metadata obtained from downloaded or cached wheels. For this I'd say we should investigate if there is a tangible performance benefit, because I'm under the impression getting the metadata from inside the locally cached wheel is fast enough.

What do you think @sbidoul?

Easier said than done, I know :) And caching is hard. But I'd suggest addressing each aspect independently, as these are very different and independent problems.

sbidoul avatar Sep 04 '23 11:09 sbidoul

  1. Caching metadata obtained from downloaded or cached wheels. For this I'd say we should investigate if there is a tangible performance benefit, because I'm under the impression getting the metadata from inside the locally cached wheel is fast enough.

The issue I was trying to explain earlier is that currently, we only enter built wheels into the wheel cache, not downloaded wheels. While wheels in the wheel cache are fast to extract metadata from, we would have to enter every downloaded wheel into the wheel cache as well in order to avoid making an additional HTTP request. That would mean we end up doubling the disk space used to cache each downloaded wheel.

  1. Caching .metadata files from indexes. I can't say I fully understand the subtleties of the interactions between the pip http cache and PyPI, but my intuition says these should be cached at the http level (like wheels), and that any potential improvements should be investigated in details in the CacheControl area first before leaking that caching complexity elsewhere in the code base.

Ok, I will look into adding a separate code path that checks the contents of our CacheControlAdapter and avoids re-sending an HTTP request when the target URL points to a .whl or .metadata file (something that won't change). Currently, this is the major reason this PR improves performance so drastically.

  1. Caching metadata we computed locally is something I consider useful (https://github.com/pypa/pip/issues/11165), because there are several scenario where pip downloads a sdist or a VCS repo to compute the metadata and then discard it, if a wheel is not built for any reason (already installed, rejected version, ...). The decision to cache the metadata or not and the way to structure the cache should likely be as similar as possible to the cache of locally built wheels, for least-surprise behaviour.

This PR implements this, and my argument is that it happens to be significantly less code to simply extend this to cases 2 and 3 that you mentioned above.

I would appreciate if reviewers would help to merge #12186, as this PR is on top of that one. In the meantime I will try modifying our CacheControlAdapter to return cached HTTP responses in order to compare the complexity of the two approaches. I strongly prefer having exactly one layer of caching that pip understands and controls, and letting CacheControl manage the HTTP cache as an implementation detail, instead of having multiple separate ways or places that a Link's metadata might or might not be cached.

cosmicexplorer avatar Sep 07 '23 11:09 cosmicexplorer

Ok, I apologize, but I was mistaken earlier regarding the behavior of our SafeFileCache. I assumed that it must have been making network requests to be that slow, but in fact, our usage of CacheControl is working correctly, and when cached, we do not make any further network requests.

Cached Requests: .metadata

It turns out that .metadata files are quickly loaded from cache already. The below output demonstrates fetching a cached .metadata file for zipp and processing the metadata in less than 30 milliseconds:

> python3.8 -m pip -vvv install -dry-run --ignore-installed --report test.json 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' 2>&1 \
    | ts -m '%H:%M:%.S'
# ...
13:50:58.449968 Given no hashes to check 80 links for project 'zipp': discarding no candidates
13:50:58.457427 Collecting zipp>=0.5 (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard~=2.5->tensorflow-gpu==2.5.3)
13:50:58.457609   Obtaining dependency information for zipp>=0.5 from https://files.pythonhosted.org/packages/8c/08/d3006317aefe25ea79d3b76c9650afabaf6d63d1c8443b236e7405447503/zipp-3.16.2-py3-none-any.whl.metadata
13:50:58.457866   Created temporary directory: /tmp/pip-unpack-bxqw_e6x
13:50:58.459034   Looking up "https://files.pythonhosted.org/packages/8c/08/d3006317aefe25ea79d3b76c9650afabaf6d63d1c8443b236e7405447503/zipp-3.16.2-py3-none-any.whl.metadata" in the cache
13:50:58.459579   Current age based on date: 23184
13:50:58.459719   Ignoring unknown cache-control directive: immutable
13:50:58.459846   Freshness lifetime from max-age: 365000000
13:50:58.459973   The response is "fresh", returning cached response
13:50:58.460118   365000000 > 23184
13:50:58.460403   Using cached zipp-3.16.2-py3-none-any.whl.metadata (3.7 kB)
13:50:58.476375 1 location(s) to search for versions of pyasn1:
# ...

Cached Requests: Large Wheels

The issue is instead that: when a large .whl file is retrieved from the HTTP cache in pip._vendor.cachecontrol.controller.CacheController#cached_request(), deserializing the cached response: https://github.com/pypa/pip/blob/3e232ce9ab89e29cf5e6ebcb616fe6ef6d85946d/src/pip/_vendor/cachecontrol/controller.py#L155 takes almost a full second, and "downloading" the file takes another second, producing the following timestamped debug output:

> python3.8 -m pip -vvv install -dry-run --ignore-installed --report test.json 'numpy>=1.19.5' 'keras==2.4.3' 'mtcnn' 'pillow>=7.0.0' 'bleach>=2.1.0' 'tensorflow-gpu==2.5.3' 2>&1 \
    | ts -m '%H:%M:%.S'
# ...
13:50:45.355115 Given no hashes to check 1 links for project 'tensorflow-gpu': discarding no candidates
13:50:45.361000 Collecting tensorflow-gpu==2.5.3
13:50:45.361295   Created temporary directory: /tmp/pip-unpack-f0uss145
13:50:45.362444   Looking up "https://files.pythonhosted.org/packages/80/4d/3a008dc31225768318e7ba0f7f95aa4677b0936805be40e37036b7755d62/tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl" in the cache
13:50:46.085530   Current age based on date: 23233
13:50:46.085740   Ignoring unknown cache-control directive: immutable
13:50:46.085941   Freshness lifetime from max-age: 365000000
13:50:46.086098   The response is "fresh", returning cached response
13:50:46.086246   365000000 > 23233
13:50:46.098546   Using cached tensorflow_gpu-2.5.3-cp38-cp38-manylinux2010_x86_64.whl (460.4 MB)
13:50:46.982757 Given no hashes to check 65 links for project 'numpy': discarding no candidates
# ...

Extracting the metadata from the cached tensorflow-gpu wheel covers the timestamps 13:50:45.35 to 13:50:47, or about 1.7 seconds. It also downloads the (very large) wheel contents into a temporary directory, which then has to be deleted when pip exits.

So while the .metadata files are already processed very quickly, we are quite wasteful regarding cached wheel files without a PEP 658 .metadata entry: if the first tensorflow-gpu wheel had not matched our tags, we would have ended up copying out multiple gigabyte-sized wheels into temporary dirs during the pip command, which would all get deleted at the end of the command.

cosmicexplorer avatar Sep 07 '23 18:09 cosmicexplorer

So to address @sbidoul's comment https://github.com/pypa/pip/pull/12256#issuecomment-1705097388 after this research:

  1. Caching metadata we computed locally (#11165): I will modify this change to only cache metadata from these dists.
  2. Caching .metadata files from indexes: This works already (see above comment).
  3. Caching metadata obtained from downloaded or cached wheels: This will be moved into a separate PR.
    • This is valuable for the same reasons as #12208: to cover wheels without PEP 658 metadata, which include large ones like tensorflow-gpu.

@sbidoul @uranusjr @pfmoore: Does this approach seem appropriate?

cosmicexplorer avatar Sep 07 '23 18:09 cosmicexplorer

(I still think having the single LinkMetadataCache which is used uniformly for all types of Links will be much easier to debug and maintain than having this complex logic that only caches specific types of Links.)

cosmicexplorer avatar Sep 07 '23 18:09 cosmicexplorer

The above sounds reasonable to me. I also wonder if it’s possible to cache the (new) fast-dep responses; that would be an alternative solution to the large-wheel problem.

uranusjr avatar Sep 12 '23 07:09 uranusjr

I also wonder if it’s possible to cache the (new) fast-dep responses; that would be an alternative solution to the large-wheel problem.

I totally hadn't considered this: you've just figured out the perfect compromise! It is absolutely correct that the only dists which can't be quickly fetched from the HTTP cache are large wheels without PEP 658 metadata, which is also precisely the use case for fast-deps/LazyWheelOverHTTP from #12208!

@sbidoul: I will modify this PR so that we only populate cache entries in LinkMetadataCache for:

  1. cacheable sdists (i.e. sdists which return True for the existing should_cache() method).
  2. dists generated from lazy wheel metadata (i.e. wheels without PEP 658 metadata)

cosmicexplorer avatar Sep 14 '23 16:09 cosmicexplorer

Ok. the above has been implemented, which means we only cache metadata for cacheable sdists and lazy wheel dists, and otherwise rely on the CacheControl HTTP cache for PEP 658 metadata. I also created a new exception class so any errors with this new caching are made extremely obvious: https://github.com/pypa/pip/blob/227d8e8dd2f103728519b3aa779a0020690f83f2/src/pip/_internal/exceptions.py#L256-L257

cosmicexplorer avatar Sep 14 '23 22:09 cosmicexplorer

@sbidoul: before I ask you to review this in depth, does @uranusjr's proposal (which I've implemented here) to use this metadata cache for both lazy wheels and cacheable sdists (sdists which return True from should_cache()) seem like an appropriate compromise? Otherwise the PEP 658 metadata is indeed already efficiently retrieved from the CacheControl HTTP cache, just as you suspected.

cosmicexplorer avatar Sep 18 '23 01:09 cosmicexplorer

@cosmicexplorer I still have some open questions after a bit of further thinking.

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

Assuming that we conclude that fast-deps is going to be abandoned, then there are only ~3~ 2 cases left:

  • local metadata preparations for source trees and sdists: a metadata cache is useful for that, similar to the cache of locally built wheels
  • .metadata cache: the HTTP cache is adequate as you concluded
  • ~metadata from downloaded wheels: I wonder if optimizations to the http cache such as https://github.com/pypa/pip/pull/11143 would help~

[edit] if we have .metadata for all wheels in the index, item 3 does not apply anymore

sbidoul avatar Sep 20 '23 20:09 sbidoul

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

I believe so yes.

dstufft avatar Sep 21 '23 02:09 dstufft

@cosmicexplorer I still have some open questions after a bit of further thinking.

My first one is if there is still plan for PyPI to backfill PEP 658 metadata?

Backfilling is being tracked at https://github.com/pypi/warehouse/issues/8254.

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

@cosmicexplorer I don't suppose you saw this question / have a chance to take a look at it? :-)

edmorley avatar Dec 17 '23 12:12 edmorley

@edmorley @sbidoul:

Assuming that is the case, will fast-deps be something that we still need? I've not followed all the discussions about it but I kind of remember talks about dropping it. Was it because PEP 658 makes it unncessary?

Please see #12208, which lays out the case for the continued use of the fast-deps feature (and also dramatically improves the implementation). Note that @radoering has also recently reviewed that code and even adapted it for poetry already (see his many very helpful comments in that PR), which demonstrates another vote of confidence for the fast-deps feature as perfected in #12208. The fast-deps implementation complexity is strictly limited to lazy_wheel.py with extremely thorough documentation in order to minimize the maintenance cost of this functionality. Also note that pypi itself does not need to do anything at all to support fast-deps, as this implementation is extremely robust and handles a variety of behaviors.

In a followup PR, I plan to simply remove the fast-deps feature and turn it on by default, as one among multiple metadata fetching methods including PEP 658 or cached sdists.

cosmicexplorer avatar Jan 11 '24 06:01 cosmicexplorer