List HTTP caches as well in `pip cache list`
Description
pip by default stores http responses in cache instead of whl files (at least on Win). pip cache list doesn't display those files.
Expected behavior
pip cache list should display files from http cache (with proper *.whl names, not hash).
pip version
pip 21.2.4 from c:\program files\python38\lib\site-packages\pip (python 3.8)
Python version
Python 3.8.6 (tags/v3.8.6:db45529, Sep 23 2020, 15:52:53) [MSC v.1927 64 bit (AMD64)]
OS
Win7 (64 bit)
How to Reproduce
> pip cache info
> pip cache list
Output
> pip cache info
Package index page cache location: c:\users\user\appdata\local\pip\cache\http
Package index page cache size: 731.3 MB
Number of HTTP files: 359
Wheels location: c:\users\user\appdata\local\pip\cache\wheels
Wheels size: 0 bytes
Number of wheels: 0
> pip cache list
Nothing cached.
Code of Conduct
- [X] I agree to follow the PSF Code of Conduct.
I'd like to work on this. It would be much appreciated if you could let me know of anything I should know before working on this (this will be my first contribution to pip).
One important thing to know is that the HTTP cache is managed by a 3rd party library that we vendor, and the layout is not documented (it's an internal detail of the library). So it's not immediately clear how much useful information can be listed here. For example, I'm not sure we can identify how many cached responses correspond to "HTTP files".
Whoever works on this should probably do some investigation to understand (a) what information people want to get from such a report, and (b) is it possible to get that information from cachecontrol in a supported way.
@pfmoore Thanks for replying!
Regarding (a) - @jacekt initially wrote that he would expect to see the wheel names cached in the form of HTTP responses, consistent with the format of cached wheel files. That makes sense to me as well.
Regarding (b) - I've been looking into this for a bit, and found that I was able to create an initial PoC where I use the Serializer class of cachecontrol to load the different cached responses. I landed at the conclusion that I could probably write a somewhat-reliable working version of this (iterate over all cached HTTP response files, deserialize each, then load their body, check if it's a wheel, and if so parse it to fetch its name). However, that would probably mean interleaving the cache command's implementation with cachecontrol's implementation.
Hence, it seems like the answer to the question would be - not really.
One important thing to know is that the HTTP cache is managed by a 3rd party library that we vendor, and the layout is not documented (it's an internal detail of the library). So it's not immediately clear how much useful information can be listed here. For example, I'm not sure we can identify how many cached responses correspond to "HTTP files".
Whoever works on this should probably do some investigation to understand (a) what information people want to get from such a report, and (b) is it possible to get that information from cachecontrol in a supported way. @pfmoore
wow, I'm so much excited to accidentally learn that "3rd party library" used for caching in pip can actually do Redis caching.
I wonder what would it cost to add Redis caching options to pip? It would make some CI/CD scenarios SOOO MUCH simpler!
And about cachecontrol internals, one could easily find out that FileCache DOES NOT stores original requested URLs in cache:
https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/controller.py#L258-L276
While Serializer.dumps receives request ...
https://github.com/pypa/pip/blob/main/src/pip/_vendor/cachecontrol/serialize.py#L28
... the only way it is used is to save few request headers https://github.com/pypa/pip/blob/main/src/pip/_vendor/cachecontrol/serialize.py#L65
and the cache key (URL) is destroyed by hashing inside FileCache
https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/caches/file_cache.py#L106-L111
So you can't decipher cache URL data for existing file based cache, but you can try to hook cache_set method and provide out-of-band storage for request meta information
https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/controller.py#L265
OR one could try to elaborate RedisCache, as it does not destroys original URLs by caching and uses them directly as keys, so I guess they can be enumerated by Redis' keys * command:
https://github.com/pypa/pip/blob/07a360dfe8fcad8c34d7bb70c77362cc3ec8a374/src/pip/_vendor/cachecontrol/caches/redis_cache.py#L19-L26
OR hope that pypi.org does always returns x-pypi-* headers (O_o !!!)
example of some parsed FileCache contents response headers:
{'Connection': 'keep-alive', 'Content-Length': '62843', 'Last-Modified': 'Wed, 29 Jun 2022 15:13:41 GMT', 'ETag': '"d18f682863389367f878339e288817f2"', 'x-goog-generation': '1656515621879725', 'x-goog-metageneration': '1', 'x-goog-stored-content-encoding': 'identity', 'x-goog-stored-content-length': '62843', 'Content-Type': 'application/octet-stream', 'x-goog-hash': 'md5=0Y9oKGM4k2f4eDOeKIgX8g==', 'Server': 'UploadServer', 'Cache-Control': 'max-age=365000000, immutable, public', 'Accept-Ranges': 'bytes', 'Date': 'Wed, 11 Jan 2023 17:00:42 GMT', 'Age': '5552080', 'X-Served-By': 'cache-bfi-krnt7300111-BFI, cache-hhn-etou8220055-HHN', 'X-Cache': 'HIT, HIT', 'X-Cache-Hits': '125336, 3029', 'X-Timer': 'S1673456442.418579,VS0,VE0', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'X-Frame-Options': 'deny', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Robots-Header': 'noindex', 'Access-Control-Allow-Methods': 'GET, OPTIONS', 'Access-Control-Allow-Headers': 'Range', 'Access-Control-Allow-Origin': '*', 'x-pypi-file-python-version': 'py3', 'x-pypi-file-version': '2.28.1', 'x-pypi-file-package-type': 'bdist_wheel', 'x-pypi-file-project': 'requests'}
Here you go, this is cached response for requests-2.28.1 :-)
x-pypi-file-package-type: 'bdist_wheel'
x-pypi-file-project: 'requests'
x-pypi-file-version: '2.28.1'
One option could be to externally create a map from sha244(url) to the url. I just purged my local cache, and then I installed click: https://files.pythonhosted.org/packages/c2/f1/df59e28c642d583f7dacffb1e0965d0e00b218e0186d7858ac5233dce840/click-8.1.3-py3-none-any.whl. If I do a sha244 of that url I get ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd. I see this value in my local cache:
$ pwd
/home/stian/.cache/pip/http
$ find . -type f
./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd
./8/a/c/4/d/8ac4d14dc45e27d21da49fb515570b6f875b78707de9b08ce1088d1b
So if I were to iterate through all .whl urls at https://files.pythonhosted.org/packages, hash each of them, and then create this map, then I could look up things by hashes and be able to list all my cached (downloaded) wheels.
Just to absolutely confirm that my local ./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd was the wheel, I unzipped it:
$ unzip ./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd -d ~/
Archive: ./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd
warning [./c/e/0/b/8/ce0b863dd6f9fe26fb2bdcdb08a17f0fb0c044bac2cc256d212517bd]: 26 extra bytes at beginning or within zipfile
(attempting to process anyway)
inflating: /home/stian/click/__init__.py
inflating: /home/stian/click/_compat.py
inflating: /home/stian/click/_termui_impl.py
inflating: /home/stian/click/_textwrap.py
inflating: /home/stian/click/_winconsole.py
inflating: /home/stian/click/core.py
inflating: /home/stian/click/decorators.py
inflating: /home/stian/click/exceptions.py
inflating: /home/stian/click/formatting.py
inflating: /home/stian/click/globals.py
inflating: /home/stian/click/parser.py
inflating: /home/stian/click/py.typed
inflating: /home/stian/click/shell_completion.py
inflating: /home/stian/click/termui.py
inflating: /home/stian/click/testing.py
inflating: /home/stian/click/types.py
inflating: /home/stian/click/utils.py
inflating: /home/stian/click-8.1.3.dist-info/LICENSE.rst
inflating: /home/stian/click-8.1.3.dist-info/METADATA
inflating: /home/stian/click-8.1.3.dist-info/WHEEL
inflating: /home/stian/click-8.1.3.dist-info/top_level.txt
inflating: /home/stian/click-8.1.3.dist-info/RECORD
So what things are stopping to implement this feature. everything seems to be clear in the entire discussion that the bug can be solved or feature can be implmented.
@a-sajjad72 Someone to do the work, basically. If you're interested in creating a PR for this, you're welcome to do so.
I have never understood: why, in the first place, cache HTTP data, rather than the actual downloaded data?
Anyway, here is a brief demonstration of some more consequences:
#!/bin/bash
pip cache purge -q
python -m venv test
source test/bin/activate
pip install package-installation-test -q
echo 'after initial install, the package is cached:'
pip uninstall -y -q package-installation-test
pip install package-installation-test | grep cached
echo 'but the cached package is not shown in the cache list:'
pip cache list
echo 'and cannot be explicitly removed:'
pip cache remove package-installation-test
echo 'installing it requires an internet connection:'
nmcli networking off
pip uninstall -y -q package-installation-test
pip install package-installation-test 2>&1 | grep ERROR
nmcli networking on
sleep 5 # wait for Internet connection to be re-established.
echo 'purging cache does purge it:'
pip cache purge -q
pip install package-installation-test | grep cached # n.b. no output
deactivate
rm -r test/
It's easier @zahlman. We delegate the caching to a HTTP library instead of writing our own bespoke caching layer specific for source distributions and wheels. Could we change that? Sure, but it's not something that we're looking into given the additional complexity.