Deprecate non-PEP 625 sdist support?
What's the problem this feature will solve?
Hello pip maintainers!
TL;DR: I'd like to propose removing support for non-PEP 625 source distributions, as part of aligning pip (and other standard packaging infrastructure) with the packaging PEPs/living PyPA standards.
This would be a nontrivial change, and so I imagine it would require a full (and long) deprecation cycle. I've put some notes below about how I think this could be done gracefully to avoid breakage against the majority of packages currently on PyPI (which potentially have a long tail of non-PEP 625 conformance).
For additional context:
- PyPI has fully supported PEP 625 as of this past October: https://github.com/pypi/warehouse/issues/12245
Describe the solution you'd like
I'd like pip to (eventually) reject sdists that don't conform to PEP 625. That means:
- No more support for non-standard sdist archive forms (zips,
.tar.xz, etc). - Enforcing that the
{distribution}is encoded per 625 (which stipulates normalization like in the wheel spec) - Enforcing that the
{version}is a valid and normalized PEP 440 version.
These requirements are tracked in the living version of PEP 625 as well:
https://packaging.python.org/en/latest/specifications/source-distribution-format/#source-distribution-file-format
One significant hangup with this is the long tail of sdists on PyPI that aren't necessarily conformant: a significant number of users install via sdists, and a lot of sdists either predate PEP 625 or were uploaded to PyPI before strict enforcement.
Given that, I think a staggered deprecation cycle probably makes sense here. At a high-level (these specific timeframes may not make sense):
- Begin emitting deprecation warnings for all sdists that don't match PEP 625
- 12+ months: begin rejecting sdists that aren't valid tar+gz tarballs
- 24+ months: begin rejecting sdists that don't have valid
{distribution}-{version}schemes
The rationale for the above is that non-tar+gz sdists are probably much less common than ones with non-normalized names/versions, so it's probably fine to reject them sooner. I intend to run some statistics on a public dump of PyPI to confirm that.
Separately, it probably makes sense to have some kind of escape hatch for this behavior, e.g. for integrators who have archives that are sdist-shaped but can't be given PEP 625 filenames for whatever reason. Something like pip install --allow-invalid-sdist-name ... perhaps (that's probably not a good name, but just to show the idea).
Alternative Solutions
One alternative solution is to do nothing 🙂 -- on a basic level non-conforming sdists aren't a "big deal," since an sdist is just an archive with a build script. However, insofar as conformance with the PEPs/living PyPA standards is a long term goal for Python packaging, I think aiming to conform with this one is a good idea!
Another alternative solution is to only go part-ways here: instead of hard-rejecting invalid sdists, pip could choose to emit a warning instead (and never hard-reject). The upside to this is stability; the downside is that users will probably mostly ignore the warning.
Additional context
This piqued my interest because of https://github.com/astral-sh/uv/issues/16911 -- uv (like pip) currently supports .tar.xz as well as xz-in-ZIP, despite neither being a standard-conformant distribution representation (for either sdists or wheels).
Code of Conduct
- [x] I agree to follow the PSF Code of Conduct.
CC @zanieb and @konstin 🙂
There's also an argument beyond conformance, i.e., that removing support for, e.g., xz, reduces supply chain attack surfaces (which is the main reason we care in uv and is noted in the linked uv issue).
Yeah, thanks for calling that out! I buried that a bit in the lede by accident, but I think there's a significant win to be had in reducing the number of implicitly valid archive format/compressor combinations.
In general terms, I'm very much in favour of this. But I think the biggest problem is the long tail of non-standard sdists. And it's not just on PyPI - we have no way of knowing how many in-house private packages exist which don't use the standard naming.
One thing which would be very interesting is an analysis of how many packages exist on PyPI where the current version still uses a non-standard name. Presumably at some point if we're talking about installers not supporting non-standard names, we'd want to start deleting them from PyPI - and if we did that, how many packages would be completely removed as a result?
One thing which would be very interesting is an analysis of how many packages exist on PyPI where the current version still uses a non-standard name.
Yeah, I think this would be a leading indicator for whether this'll actually be workable. I'm going to try and pull that from https://github.com/sethmlarson/pypi-data today 🙂
Edit: I realized pypi-data doesn't have this, since it doesn't track sdists separately.
Presumably at some point if we're talking about installers not supporting non-standard names, we'd want to start deleting them from PyPI - and if we did that, how many packages would be completely removed as a result?
I think we'd probably want to treat this similarly to .eggs and other deprecations, where PyPI wouldn't delete anything but newer clients would just stop supporting them (and would warn appropriately when the user tries to install directly). That way legacy users can still continue to use them, but the ecosystem as a whole can move forwards.
Removing support for xz-compressed tarballs (and possibly bz2-compressed ones) would break workflows for people using packages from linux distributions or who need distro+version specific packages.
For my case in particular, we currently need access to the python-apt source packages, which is distributed by the Debian and Ubuntu teams as xz-compressed tarballs. (Likely there are similar packages that affect people on Red Hat or SuSE based distributions too.) Given that these distributions are supported well beyond the lifecycle of a python version, even if the relevant source packages were swapped to use gzip today, there would be a very long tail where people developing on/for still-supported distros would break when using a newer pip (or uv), likely leading to people sticking with an older pip even for virtual environments.
Removing support for xz-compressed tarballs (and possibly bz2-compressed ones) would break workflows for people using packages from linux distributions or who need distro+version specific packages.
Could you say a bit more about the workflow here? My understanding is that distributions generally provide Python packages using their native packaging format. In other words, most users are supposed to do something like apt install -y python-apt to install this package, rather than fetch it directly from an archive.
I'm not the decision maker here, but I think there are a handful of mitigating considerations that will make this less troublesome for distributions:
- Per above, this won't affect distribution level packages; it's only for Python packaging, which has its own standards/metadata/etc;
- If this happens, the deprecation period probably will be extensive (it took 5 years to get from PEP 625's acceptance to adoption by PyPI; I hope this wouldn't be that long but such timelines are not unprecedented);
- On top of the above, I think the proposed escape hatch (some kind of flag or environment variable to allow non-PEP 625 sdists) would enable the use case laid out;
- Finally, distributions with LTS policies could always flip that default or carry patches for the version of pip they distribute. This definitely wouldn't be ideal, but I think it'd be similar to other cases where Python packaging has undergone standards changes that distributions may not want to pursue on the same timeline.
Removing support for xz-compressed tarballs (and possibly bz2-compressed ones) would break workflows for people using packages from linux distributions or who need distro+version specific packages.
If these types of package are needed as sdists (as opposed to simple source trees), then someone needs to propose that those formats get added to the packaging standards. At the moment, there is no requirement for an installer to support those formats. I don't expect .xz support to be that controversial (I'd imagine .bz2 support might get more pushback), so there's no reason to believe that standardising is "too hard".
Expecting the packaging ecosystem to continue indefinitely supporting formats that aren't defined by standards isn't really a reasonable position to take, IMO.
For my case in particular, we currently need access to the python-apt source packages
The .tar.xz files in that directory don't look remotely like sdists to me - they have the wrong normalisation and filename structure. They are almost certainly totally fine as archived source trees (so you can pip install https://ftp.debian.org/debian/pool/main/p/python-apt/python-apt_3.0.0.tar.xz) but expecting them to work as sdists (where you do pip install python-apt and expect pip to find that file) is unreasonable, IMO[^1].
[^1]: A standards-conforming installer would see that file as versionapt_3.0.0 of package python, and reject it as the version doesn't conform to the version number standard.
w.r.t to filename parsing and restrictions I would like to see that code come from packaging but I believe there are still open questions, or at least work still to do:
https://github.com/pypa/packaging/issues/527 https://github.com/pypa/packaging/issues/873
w.r.t to filename parsing and restrictions I would like to see that code come from
packagingbut I believe there are still open questions, or at least work still to do:https://github.com/pypa/packaging/issues/527
Hoisted upon my own petard...I forgot I filed that 4 years ago. I have some thoughts on disambiguating that API that I'll follow up on there.
FYI, we use currently only use parse_sdist_filename from packaging in one location in pip, when we scan a local file directory for distributions and we filter to only filenames that are valid wheels or sdists filenames. I somewhat naively implemented this, not checking on the history or outstanding questions about the function.
we have no way of knowing how many in-house private packages exist which don't use the standard naming
This is what a warning can be used to find out?
While I'd be surprised if anyone bothered using non-defaults for in-house things - the only feedback mechanism we've got there is making the code encourage people to speak up when they do so you can understand if they cannot simply recompress all of their in-house things in the supported format and if it is important to support that or not.
Could you say a bit more about the workflow here? My understanding is that distributions generally provide Python packages using their native packaging format. In other words, most users are supposed to do something like apt install -y python-apt to install this package, rather than fetch it directly from an archive.
That's great for the global install, but it doesn't work for virtual environments or when using different versions of Python. I work on packages that depend on python-apt, which is tightly coupled to the version of apt on the system. My tests run on a matrix of distro/python versions, so I need to get, for example, python-apt 2.4.0 for Ubuntu 22.04, 3.0.0 for Debian 13, etc. Because it's a set of C bindings, I then also need to build it for each version of Python on each of those platforms. A big chunk of why they don't release on pypi is because of the tight integration to the specific apt version. As a result I have extra dependency groups that define the specific versions of python-apt based on the release codename. For me the workaround would be that I would have to extract these sdists and then create a new tarball — that's not a huge burden on me, but I don't know how feasible that is for other developers' configurations.
On top of the above, I think the proposed escape hatch (some kind of flag or environment variable to allow non-PEP 625 sdists) would enable the use case laid out;
Absolutely, as long as it comes with not actually dropping support for other compression formats.
It wouldn't be nearly this easy for uv, but for pip itself I think the answer is actually pretty simple: with that flag disabled, untar_file uses mode r:gz. With the flag enabled and if the file doesn't match the filename format, it warns and then uses r:*. If it breaks at that point, we're on our own and it depends on whether that compression method is supported by our python executable.
I work on packages that depend on python-apt, which is tightly coupled to the version of apt on the system.
So, to be clear, you do something like
pip install --index-url https://some.index.server/simple python-apt==1.0.0
where there's no wheel for that version of python-apt, and pip picks up the sdist?
A big chunk of why they don't release on pypi is because of the tight integration to the specific apt version.
I guess I don't understand the constraints here: why can't they publish on PyPI, and expect people to use the right version for their version of apt (or encode that as discrete packages per apt version)? To my understanding, that's a somewhat common thing for integrations to do.
Absolutely, as long as it comes with not actually dropping support for other compression formats.
Well, I think there are two things here, and we should be clear about which one affects you:
- If you're installing that archive "as if" it's an sdist, then I'd expect an escape hatch here to eventually be deprecated (or for the unintended behavior to be standardized; either seems fine!)
- If you're installing that archive in manner @pfmoore mentioned (i.e. treating it as a source tree), then this is in the territory of unstandardized tool-specific behavior. I'm not proposing that pip remove its source tree handling behavior at all, so in some sense this might not affect you at all 🙂
Regarding (2) though — if pip supports unarchiving arbitrary source trees, then we're in a tough spot from a perspective of reducing supply chain surface area.
Regarding (2) though — if pip supports unarchiving arbitrary source trees, then we're in a tough spot from a perspective of reducing supply chain surface area.
Yeah, true...I take that back then, sorry for the confusion. I guess from my perspective the ideal behavior here would be:
- pip (and uv, etc.) all use the same (PEP 625) definition of what an sdist is, and reject sdist-shaped things that aren't actually PEP 625 compliant. This would take a deprecation period per above, but eventually would mean hard-rejecting non
.tar.gz(or other standardized) sdists. - For arbitrary source trees, pip (and uv, etc.) would ideally not directly support arbitrary archive formats, and maybe only support the subset that sdists are also standardized to support.
Hang on, I think we're getting out of the realm of standards at this point.
There's a formal standard on what a sdist is - it's a .tar.gz archive with a particular naming convention, and it's used when tools want to find a distribution for package foo, version X.Y. The naming convention is necessary there in order to unambiguously map foo-X.Y to a filename. That's fine, and I'm perfectly OK with pip (and other tools) enforcing the standards. In fact, given that tools which produce sdists (i.e., build backends) also follow those standards, users would have to take special action[^1] to get a sdis which didn't follow the conventions.
Outside of sdists, tools also support installing packages from source. This means things like pip install . or uv pip install .. The source location being installed from is often the CWD (.), but may also be a local directory. And pip goes further than that (and I assume uv pip does too, if the original objective of matching pip's behaviour remains a policy) by allowing such source locations to take other forms - notably VCS URLs, or archives of a source directory. What archive formats pip allows is a tool choice - just like what VCS systems we support is. The key thing here is that pip never looks for source archive files based on package name/version - all pip ever does is install from a URL that is explicitly specified by the user.
Asking pip to remove support for particular source archive formats is not what this issue is about, and if someone wants to request that, it should be a separate issue. But I'll say right now that we'd be very reluctant to do so, unless there were clear and significant security risks that pip is currently exposed to. And the reason we'd be reluctant is precisely because our users can legitimately rely on our existing support[^2], and withdrawing it would be a potentially very disruptive regression, which we'd need to justify.
What I'm not clear about (and why this discussion started in the first place) is whether the python-apt files @lengau is referring to are sdists or source archives, in the senses I describe above. It's frustratingly difficult to get a clear answer to that question, because the difference between the two is subtle, and all but incomprehensible to anyone who is not a packaging specialist 🙁 So I have some sympathy with @lengau regarding his concerns that we might be removing something he relies on - but it's not yet clear whether that is the case, and we need a better understanding of the workflows he is concerned about in order to establish that.
@lengau - could you possibly link to some documentation of exactly how users are expected to get python-apt? Assuming that documentation is clear enough for a non-specialist to follow, I should be able to establish from it whether the files are being treated as sdists or as source archives.
[^1]: Or deliberately use older versions of their build backends which didn't support the sdist standard. [^2]: I checked the documentation, and we say you can install from "Local or remote source archives". I couldn't find any explicit discussion of what archive formats are supported, so in theory we could drop support for particular formats if we wanted to, but in practice, doing so without a very good reason would not be reasonable.
I agree this issue is intended to focus on source distributions exclusively. I'm just also noting that the motivation of reducing supply chain attack surface may not be applicable to pip (and, consequently, perhaps uv) since xz support (just as an example) would need to remain for source trees. I think the other motivations are still compelling.
I don't mind moving to a separate issue to discuss the idea of reducing the number of supported archive formats. I think we will be pretty motivated in uv to drop xz support unless there are compelling use-cases (i.e., perhaps these python-apt archives)