pip icon indicating copy to clipboard operation
pip copied to clipboard

List HTTP caches as well in `pip cache list`

Open jacekt opened this issue 4 years ago • 9 comments

Description

pip by default stores http responses in cache instead of whl files (at least on Win). pip cache list doesn't display those files.

Expected behavior

pip cache list should display files from http cache (with proper *.whl names, not hash).

pip version

pip 21.2.4 from c:\program files\python38\lib\site-packages\pip (python 3.8)

Python version

Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]

OS

Win7 (64 bit)

How to Reproduce

> pip cache info
> pip cache list

Output

> pip cache info
Package index page cache location: c:\users\user\appdata\local\pip\cache\http
Package index page cache size: 731.3 MB
Number of HTTP files: 359
Wheels location: c:\users\user\appdata\local\pip\cache\wheels
Wheels size: 0 bytes
Number of wheels: 0

> pip cache list
Nothing cached.

Code of Conduct

jacekt avatar Sep 11 '21 08:09 jacekt

I'd like to work on this. It would be much appreciated if you could let me know of anything I should know before working on this (this will be my first contribution to pip).

itaisteinherz avatar Jan 30 '22 15:01 itaisteinherz

One important thing to know is that the HTTP cache is managed by a 3rd party library that we vendor, and the layout is not documented (it's an internal detail of the library). So it's not immediately clear how much useful information can be listed here. For example, I'm not sure we can identify how many cached responses correspond to "HTTP files".

Whoever works on this should probably do some investigation to understand (a) what information people want to get from such a report, and (b) is it possible to get that information from cachecontrol in a supported way.

pfmoore avatar Jan 30 '22 17:01 pfmoore

@pfmoore Thanks for replying!

Regarding (a) - @jacekt initially wrote that he would expect to see the wheel names cached in the form of HTTP responses, consistent with the format of cached wheel files. That makes sense to me as well.

Regarding (b) - I've been looking into this for a bit, and found that I was able to create an initial PoC where I use the Serializer class of cachecontrol to load the different cached responses. I landed at the conclusion that I could probably write a somewhat-reliable working version of this (iterate over all cached HTTP response files, deserialize each, then load their body, check if it's a wheel, and if so parse it to fetch its name). However, that would probably mean interleaving the cache command's implementation with cachecontrol's implementation. Hence, it seems like the answer to the question would be - not really.

itaisteinherz avatar Jan 30 '22 18:01 itaisteinherz

One important thing to know is that the HTTP cache is managed by a 3rd party library that we vendor, and the layout is not documented (it's an internal detail of the library). So it's not immediately clear how much useful information can be listed here. For example, I'm not sure we can identify how many cached responses correspond to "HTTP files".

Whoever works on this should probably do some investigation to understand (a) what information people want to get from such a report, and (b) is it possible to get that information from cachecontrol in a supported way. @pfmoore

wow, I'm so much excited to accidentally learn that "3rd party library" used for caching in pip can actually do Redis caching.

I wonder what would it cost to add Redis caching options to pip? It would make some CI/CD scenarios SOOO MUCH simpler!

And about cachecontrol internals, one could easily find out that FileCache DOES NOT stores original requested URLs in cache:

https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/controller.py#L258-L276

While Serializer.dumps receives request ... https://github.com/pypa/pip/blob/main/src/pip/_vendor/cachecontrol/serialize.py#L28

... the only way it is used is to save few request headers https://github.com/pypa/pip/blob/main/src/pip/_vendor/cachecontrol/serialize.py#L65

and the cache key (URL) is destroyed by hashing inside FileCache https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/caches/file_cache.py#L106-L111

So you can't decipher cache URL data for existing file based cache, but you can try to hook cache_set method and provide out-of-band storage for request meta information https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/controller.py#L265

OR one could try to elaborate RedisCache, as it does not destroys original URLs by caching and uses them directly as keys, so I guess they can be enumerated by Redis' keys * command: https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/caches/redis_cache.py#L19-L26

OR hope that pypi.org does always returns x-pypi-* headers (O_o !!!)

example of some parsed FileCache contents response headers:

{'Connection': 'keep-alive', 'Content-Length': '62843', 'Last-Modified': 'Wed, 29 Jun 2022 15:13:41 GMT', 'ETag': '"d18f682863389367f878339e288817f2"', 'x-goog-generation': '1656515621879725', 'x-goog-metageneration': '1', 'x-goog-stored-content-encoding': 'identity', 'x-goog-stored-content-length': '62843', 'Content-Type': 'application/octet-stream', 'x-goog-hash': 'md5=0Y9oKGM4k2f4eDOeKIgX8g==', 'Server': 'UploadServer', 'Cache-Control': 'max-age=365000000, immutable, public', 'Accept-Ranges': 'bytes', 'Date': 'Wed, 11 Jan 2023 17:00:42 GMT', 'Age': '5552080', 'X-Served-By': 'cache-bfi-krnt7300111-BFI, cache-hhn-etou8220055-HHN', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '125336, 3029', 'X-Timer': 'S1673456442.418579,VS0,VE0', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Frame-Options': 'deny', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Robots-Header': 'noindex', 'Access-Control-Allow-Methods': 'GET, OPTIONS', 'Access-Control-Allow-Headers': 'Range', 'Access-Control-Allow-Origin': '*', 'x-pypi-file-python-version': 'py3', 'x-pypi-file-version': '2.28.1', 'x-pypi-file-package-type': 'bdist_wheel', 'x-pypi-file-project': 'requests'}

Here you go, this is cached response for requests-2.28.1 :-) x-pypi-file-package-type: 'bdist_wheel' x-pypi-file-project: 'requests' x-pypi-file-version: '2.28.1'

mshonichev avatar Jan 11 '23 16:01 mshonichev

One option could be to externally create a map from sha244(url) to the url. I just purged my local cache, and then I installed click: https://files.pythonhosted.org/packages/c2/f1/df59e28c642d583f7dacffb1e0965d0e00b218e0186d7858ac5233dce840/click-8.1.3-py3-none-any.whl. If I do a sha244 of that url I get ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd. I see this value in my local cache:

$ pwd
/home/stian/.cache/pip/http
$ find . -type f
./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd
./8/a/c/4/d/8ac4d14dc45e27d21da49fb515570b6f875b78707de9b08ce1088d1b

So if I were to iterate through all .whl urls at https://files.pythonhosted.org/packages, hash each of them, and then create this map, then I could look up things by hashes and be able to list all my cached (downloaded) wheels.


Just to absolutely confirm that my local ./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd was the wheel, I unzipped it:

$ unzip ./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd -d ~/
Archive:  ./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd
warning [./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd]:  26 extra bytes at beginning or within zipfile
  (attempting to process anyway)
  inflating: /home/stian/click/__init__.py
  inflating: /home/stian/click/_compat.py
  inflating: /home/stian/click/_termui_impl.py
  inflating: /home/stian/click/_textwrap.py
  inflating: /home/stian/click/_winconsole.py
  inflating: /home/stian/click/core.py
  inflating: /home/stian/click/decorators.py
  inflating: /home/stian/click/exceptions.py
  inflating: /home/stian/click/formatting.py
  inflating: /home/stian/click/globals.py
  inflating: /home/stian/click/parser.py
  inflating: /home/stian/click/py.typed
  inflating: /home/stian/click/shell_completion.py
  inflating: /home/stian/click/termui.py
  inflating: /home/stian/click/testing.py
  inflating: /home/stian/click/types.py
  inflating: /home/stian/click/utils.py
  inflating: /home/stian/click-8.1.3.dist-info/LICENSE.rst
  inflating: /home/stian/click-8.1.3.dist-info/METADATA
  inflating: /home/stian/click-8.1.3.dist-info/WHEEL
  inflating: /home/stian/click-8.1.3.dist-info/top_level.txt
  inflating: /home/stian/click-8.1.3.dist-info/RECORD

stianlagstad avatar Jan 24 '23 12:01 stianlagstad

So what things are stopping to implement this feature. everything seems to be clear in the entire discussion that the bug can be solved or feature can be implmented.

a-sajjad72 avatar Jun 10 '24 20:06 a-sajjad72

@a-sajjad72 Someone to do the work, basically. If you're interested in creating a PR for this, you're welcome to do so.

pfmoore avatar Jun 10 '24 21:06 pfmoore

I have never understood: why, in the first place, cache HTTP data, rather than the actual downloaded data?

Anyway, here is a brief demonstration of some more consequences:

#!/bin/bash
pip cache purge -q
python -m venv test
source test/bin/activate
pip install package-installation-test -q
echo 'after initial install, the package is cached:'
pip uninstall -y -q package-installation-test
pip install package-installation-test | grep cached
echo 'but the cached package is not shown in the cache list:'
pip cache list
echo 'and cannot be explicitly removed:'
pip cache remove package-installation-test
echo 'installing it requires an internet connection:'
nmcli networking off
pip uninstall -y -q package-installation-test
pip install package-installation-test 2>&1 | grep ERROR
nmcli networking on
sleep 5 # wait for Internet connection to be re-established.
echo 'purging cache does purge it:'
pip cache purge -q
pip install package-installation-test | grep cached # n.b. no output
deactivate
rm -r test/

zahlman avatar Mar 27 '25 03:03 zahlman

It's easier @zahlman. We delegate the caching to a HTTP library instead of writing our own bespoke caching layer specific for source distributions and wheels. Could we change that? Sure, but it's not something that we're looking into given the additional complexity.

ichard26 avatar Mar 31 '25 02:03 ichard26