pip Add support for listing HTTP cached packages in pip cache list

This PR adds support for listing HTTP-cached packages in pip cache list and introduces new flags to control output, addressing issue #10460.

Problem

Currently, pip cache list only shows locally built wheels stored in the wheels/ cache directory, but ignores HTTP cached packages stored in the http-v2/ or http directory. This leads to confusing behavior where users see "No locally built wheels cached" even when pip has cached wheel files from PyPI downloads.

$ pip cache info
Package index page cache size: 89 MB
Number of HTTP files: 815
Locally built wheels size: 8.9 MB
Number of locally built wheels: 35

$ pip cache list
No locally built wheels cached.  # Misleading - there are cached packages!

Solution

This PR extends pip cache list to extract package information from cached file content by inspecting the package structure offline, and adds flags to control what is listed:

Parse CacheControl metadata using cachecontrol.Serializer to locate cached response details.
Inspect cached body files to derive artifact filenames and sizes:
- Wheel files: Read .dist-info/WHEEL metadata from the ZIP; when needed, construct a filename using the first Tag: entry; fall back to {name}-{version}.whl.
- Tarballs: Read the tar structure to derive {name}-{version}.tar.gz from the root directory name.
Add CLI flags:
- --http: list only HTTP cache files.
- --all: list both locally built wheels and HTTP cache files in a unified list, suffixing HTTP entries with [HTTP cached].
Operates completely offline (no network), and is index-agnostic (not PyPI-specific).

CLI changes

Users can control which cache types to list:

pip cache list            # List locally built wheels (default; unchanged)
pip cache list --http     # List only HTTP cache files
pip cache list --all      # Unified list; HTTP entries marked with [HTTP cached]

Examples

Human-readable output shows filenames and sizes. When --http is used alone, entries are listed under an HTTP section; with --all, entries are unified and HTTP items are suffixed.

$ pip cache list --http
HTTP cache files:
 - certifi-2025.1.31.tar.gz (167 kB)
 - Django-2.1.15-py3-none-any.whl (7.3 MB)

$ pip cache list --all
 - certifi-2025.1.31.tar.gz (167 kB) [HTTP cached]
 - Django-2.1.15-py3-none-any.whl (7.3 MB) [HTTP cached]
 - setuptools-68.0.0-py3-none-any.whl (804 kB)

Implementation Details

The implementation extracts filenames by reading cached package structures:

Wheel Files

Opens the cached wheel file as a ZIP archive
Locates the .dist-info directory to get package name and version
Reads the WHEEL metadata file and uses the first Tag: value when constructing a filename if needed
Constructs a filename when needed: {name}-{version}-{tag}.whl; falls back to {name}-{version}.whl

Tarball Files

Opens the cached tarball as a tar archive
Reads the root directory name which follows the format: {name}-{version}/
Constructs filename: {name}-{version}.tar.gz

File Sizes

Uses the .body file size (actual package) instead of metadata file size
Provides accurate size information for all cached packages

Exclusions

Files without identifiable package names (HTML index pages, etc.) are automatically excluded
Only package files (wheels and tarballs) are displayed

Additional notes:

Operates fully offline; no network requests or server-specific headers
Compatible with any package index

Testing

Comprehensive test coverage includes:

Wheel body content extraction with tags
Tarball body content extraction
Files without extractable names are excluded
Body files are skipped correctly
Corrupted files handled gracefully

All pip cache list tests pass and verify the offline body-content inspection approach.

Feedback & Discussion

All suggestions, reviews, and discussions are welcome. If there are any concerns about naming, option design, or consistency with the existing pip CLI and API, I am happy to refactor or adjust the implementation. The goal is to make this feature both intuitive for end users and maintainable for contributors going forward.

Related Issues

Closes #10460.

This PR directly addresses the problem reported in #10460. If there are other related issues that overlap with this functionality, please feel free to reference them here so they can be resolved by this change as well.

Sep 20 '25 19:09 a-sajjad72

Hi @a-sajjad72, thanks for submitting a PR to pip, please be aware all pip maintainers are currently supporting pip on a volunteer basis and therefore it may be some time before someone can review.

That said I have an early comment:

This PR extends pip cache list to parse HTTP cached responses and extract package information from PyPI's x-pypi-file-* headers, as suggested in the issue discussion.

Pip will not accept a PyPI specific implementation, as it's not a Python packaging standard it won't work on arbitrary indexes and there is no guarantee PyPI will continue to support it in the future.

Sep 20 '25 20:09 notatallshaw

Hi @notatallshaw , thanks for the early feedback.

I understand your concern about the current approach being considered “PyPI-specific” because it relies on the x-pypi-file-* headers. My intention wasn’t to hard‑code behavior for PyPI, but I see now that depending on those headers for core functionality effectively ties the feature to PyPI.

I do have an alternative, more index-agnostic idea in mind that would not depend on those headers as the primary source. I suggest we let an initial review happen first (so I know if there are any broader objections), and then we can discuss whether shifting to that alternative approach is the right next step.

If you prefer, I can outline that alternative sooner. Just let me know.

Thanks again for the clarification and your time. Let me know how you would like to proceed.

Sep 21 '25 23:09 a-sajjad72

For myself, I won't be reviewing this PR while it is tied to PyPI specific features, as I would not accept it, and I don't know how much of a change is required to make it index agnostic. Though I won't speak for other maintainers.

Sep 22 '25 00:09 notatallshaw

Thanks again. I’ll convert this PR to Draft and refactor it to be index-agnostic before asking for further review.

Planned minimal first step:

Generic wheel detection (ZIP magic + dist-info/METADATA Name and Version).
Skip non-artifact / HTML / JSON responses.
Use any x-pypi-file-* headers only as optional enrichment (never required).
Placeholder entry if name/version can’t be inferred (or simply skip if you prefer—let me know).
Keep HTTP listing behind the existing --cache-type flag initially.

If any maintainer would prefer an even smaller scope (e.g. wheels only, no placeholders), please let me know; otherwise I’ll proceed on this basis and update the PR description with a concise design note.

Sep 22 '25 00:09 a-sajjad72

If any maintainer would prefer an even smaller scope

I would advise that the scope be kept as small as possible while still providing a helpful user experience, to be more likely to be accepted. For example, I do not think there should be any use of PyPI only features, even as optional enrichment.

I'm sorry I can't contribute more to a design discussion right now, I don't have much experience here with the design of the cache. Which contributes to why a smaller scope will be easier for a maintainer to start a review.

Sep 22 '25 00:09 notatallshaw

I agree with everything @notatallshaw said. Furthermore, I’d like some discussion of the correctness of the whole approach. The HTTP cache is just that - a cache of HTTP requests, not a cache of downloaded files. The cache includes simple index responses and possibly other information pip has requested - presenting it as just holding wheels is misleading. Also, an index has no obligation to provide any information that a downloaded file comes from a wheel - so we know that accurate data is impossible to achieve, the best we can do is provide a guess. That guess will be accurate in many cases, but we should present it clearly as a guess, and not tempt people to rely on it.

Finally, I’m concerned about the cost of this. Wheels can be big. Have you done any testing of performance, on a large HTTP cache, with some big wheels (multiple copies of PyTorch would be a good start!) in it?

Sep 22 '25 07:09 pfmoore

Thanks @pfmoore for providing your insights on this.

The HTTP cache is just that - a cache of HTTP requests, not a cache of downloaded files. The cache includes simple index responses and possibly other information pip has requested - presenting it as just holding wheels is misleading.

Yeah, I totally agree with you that HTTP caches are just saved HTTP responses and also our required files cached wheels are one of them.

Also, an index has no obligation to provide any information that a downloaded file comes from a wheel - so we know that accurate data is impossible to achieve, the best we can do is provide a guess. That guess will be accurate in many cases, but we should present it clearly as a guess, and not tempt people to rely on it.

When I started working on it, I came to know that some of the cached directories contains responses that are .body and from which many of them are valid archive files. As I tested, I found total 361 .body responses in the caches from which 134 were valid archive files. And these files include the sdists and bdists collectively.

Finally, I’m concerned about the cost of this. Wheels can be big. Have you done any testing of performance, on a large HTTP cache, with some big wheels (multiple copies of PyTorch would be a good start!) in it?

Yes I tested it, and it (the pypi specific implementation) takes approximately the same time as pip cache info take, maybe a slighter more. but wouldn't take much time.

What will be revised approach?

The core of the revised approach is to identify packages from the .body responses in the HTTP cache, which I've found are often cached wheels (bdist) and source distributions (sdist).

More reliable metadata: Instead of parsing the METADATA file, I will extract the package name and version from the normalized .dist-info or .egg-info directory name. This is far more robust as it relies on a consistent, mandatory packaging standard.
Support for sdists: The revised approach will ensure that it handles binary distributions (bdists) as well as source distributions (sdists) formats. This provides a more comprehensive view of the cached packages.
Performance: I have verified that this method is very efficient, as it avoids extracting the full archive, even for very large files.
Unknown files: Any .body responses that are not valid wheel or sdist archives will be ignored, so the output will only contain reliably identified packages.

This approach offers a practical and significantly more reliable way to list cached packages without making incorrect assumptions about the cache's contents.

Please let me know, I will start working on it and update the PR's description.

Sep 28 '25 20:09 a-sajjad72

@pfmoore The PR description has been updated to reflect the revised HTTP cache listing implementation. Please take a look when you have time, and let me know if anything needs to be changed.

Oct 02 '25 23:10 a-sajjad72

Hello, I am totally aware that pip is currently maintained on a volunteer basis, but it's been too long I didn't receive any follow-up from pip's team. For this purpose, I am pinging some of the participants (team members) of this PR. @notatallshaw @pfmoore

Nov 01 '25 13:11 a-sajjad72

Hi @a-sajjad72 unfortunately I'm not sure I'm going to have chance to do a detailed design review or suggest a clear direction forward any time soon. I can put this on my review list but I likely won't get to this until early 2026.

That said I did quickly skim over the PR in case you do want to update it or another maintainer gets a chance to do a more detailed review:

The naming of the two new modes you've added isn't quite right, as Paul points out there can be more than packages in the HTTP, so pip cache list --http and pip cache list --all do not imply that they are looking at packages only. Perhaps --http-packages and --all-packages?
I'm not sure the right design is to have two additional modes, but I don't have a better suggestion right now
Do not import inside functions, put all imports at the top level. While there are use cases for importing inside functions this is not the style inside pip, and it can introduce security concerns from us if that function end up being used as part of the install report, so it's best just to avoid it altogether.
Avoid giant try/except blocks and avoid broad exception catching.

This is an anti-pattern:

try:
    ...
    # Lots of code
    ...

And this is an anti-pattern:

try:
    ...
except Excetion:
    pass

Ideally exceptions should be specific and targeted to a small piece of code, so the control flow is clearly readable.

If exceptions really are needed around a large part of code (unusual and must be justified) then move the code into it's own function or method and put the exception around the function or method call.

Likewise, if the broad Exception is required it must also be justified, by adding a comment, the only common scenario that I would normally accept this in is directly calling unknown code, which is not a scenario that happens for pip outside invoking a build backend.

Nov 01 '25 14:11 notatallshaw

I agree with the comments @notatallshaw made, for what it's worth. But I also don't have the free time right now to review this. I'm sorry if it's frustrating for you, but to be perfectly honest, a month with no reviews really isn't that long for a PR in this repository. We have much more impactful pieces of work that have been stalled for many months, even years, due to lack of maintainer resource. Yes, this is far from ideal, but it's a reality that we simply have to deal with.

Nov 01 '25 17:11 pfmoore

Hi @a-sajjad72 unfortunately I'm not sure I'm going to have chance to do a detailed design review or suggest a clear direction forward any time soon. I can put this on my review list but I likely won't get to this until early 2026.

No problem at all, I'm not in a rush. I was just checking in to follow up on my changes.

The naming of the two new modes you've added isn't quite right, as Paul points out there can be more than packages in the HTTP, so pip cache list --http and pip cache list --all do not imply that they are looking at packages only. Perhaps --http-packages and --all-packages?

Yes, you are right, and I already agreed with that earlier. However, this time, I was focusing on keeping the flag names short for convenience and didn't pay attention to that detail. Including packages in the flag names is a good idea to clearly indicate what they are referring to.

I'm not sure the right design is to have two additional modes, but I don't have a better suggestion right now

Do not import inside functions, put all imports at the top level. While there are use cases for importing inside functions this is not the style inside pip, and it can introduce security concerns from us if that function end up being used as part of the install report, so it's best just to avoid it altogether.

Avoid giant try/except blocks and avoid broad exception catching. ...

Thank you for the quick overview of the PR changes. I’ll ensure to avoid these issues and adhere to pip’s style guide.

Nov 06 '25 00:11 a-sajjad72

I'm sorry if it's frustrating for you, but to be perfectly honest, a month with no reviews really isn't that long for a PR in this repository.

No worries at all, I was just following up to see if there’s been any progress on my changes.

We have much more impactful pieces of work that have been stalled for many months, even years, due to lack of maintainer resource. Yes, this is far from ideal, but it's a reality that we simply have to deal with.

I’m personally interested in contributing as a maintainer to support the project and enhance my own learning. If you could guide me through the roadmap, it would help me get up to speed and become an effective part of the pip team.

If you’re comfortable sharing the roadmap here, please do so; otherwise, feel free to reach me at [email protected] or [email protected]

Nov 06 '25 00:11 a-sajjad72