packaging
packaging copied to clipboard
Ambiguous version parsing in `parse_sdist_filename`
Hi there! First of all, thanks for continuing to maintain this package -- it's extremely useful 🙂
I'm one of the maintainers of pip-audit
, and we had a user report some strange dependency resolution behavior: https://github.com/trailofbits/pip-audit/issues/248
We were able to root-cause the bug down to a release of cffi
(1.0.2-2
) that uses the implicit post releases syntax for specifying the post-release number, rather than the canonicalized postN
format. This release of cffi
is published on PyPI here, without canonicalization, so it's likely that it was uploaded before PyPI began normalizing versions.
Because 1.0.2-2
contains a dash, the following body of packaging.utils.parse_sdist_filename
contains an incorrect assumption and parses the source distribution name incorrectly:
def parse_sdist_filename(filename: str) -> Tuple[NormalizedName, Version]:
if filename.endswith(".tar.gz"):
file_stem = filename[: -len(".tar.gz")]
elif filename.endswith(".zip"):
file_stem = filename[: -len(".zip")]
else:
raise InvalidSdistFilename(
f"Invalid sdist filename (extension must be '.tar.gz' or '.zip'):"
f" {filename}"
)
# We are requiring a PEP 440 version, which cannot contain dashes,
# so we split on the last dash.
name_part, sep, version_part = file_stem.rpartition("-")
if not sep:
raise InvalidSdistFilename(f"Invalid sdist filename: {filename}")
name = canonicalize_name(name_part)
version = Version(version_part)
return (name, version)
yielding:
>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi-1-0-2', <Version('2')>)
whereas we expected:
>>> from packaging.utils import parse_sdist_filename
>>> parse_sdist_filename("cffi-1.0.2-2.tar.gz")
('cffi', <Version('1.0.2.post2')>)
TL;DR: parse_sdist_filename
shouldn't rely on the last dash as a separator between the distribution name and the version, since PEP 440 allows dashes in non-normalized versions. Parsing this correctly poses a bit of a challenge, since distribution names can also contain dashes and numbers and might even contain them in pathological ways, such as:
# package foo3, version 1.0.0.post1
foo3-1.0.0-1.tar.gz
# package foo-3, version 1.0.0.post1
foo-3-1.0.0-1.tar.gz
# package 3_3, version 1.0.0.post1
# i'm not sure this one is valid, but i can't find a countervailing spec in any of the packaging PEPs
3-3-1.0.0-1.tar.gz
(Another possible solution here is to normalize all of the current distribution releases hosted on PyPI, and emphasize that parse_sdist_filename
only supports normalized PEP 440 versions. But I'm not sure about the blast radius/impact of doing so -- it would mean changing a lot of the index filenames, if not their content-addressed distribution filenames on the CDN.)
Here's the valid distribution regexp in PEP 508:
^([A-Z0-9]|[A-Z0-9][A-Z0-9._-]*[A-Z0-9])$
So 3-3
and 3_3
are valid distribution names, making 3-3-1.0.0-1.tar.gz
and 3_3-1.0.0-1.tar.gz
valid pre-normalized sdist filenames.
I'm not sure I would consider this to be a bug in packaging
- we are assuming the filename has a normalized PEP 440 version, which technically cffi-1.0.2-2.tar.gz
does, and have no way to determine that this is an implicit post release and not a project named cffi-1.0.2
with a version 2
.
One thing we could do here is to allow the caller to optionally provide the distribution name they were expecting to be present, so this function could either use this as a hint during parsing, or just raise an exception when what it found isn't what was expected. I'm not sure how useful this would be broadly, though.
(Another possible solution here is to normalize all of the current distribution releases hosted on PyPI, and emphasize that parse_sdist_filename only supports normalized PEP 440 versions. But I'm not sure about the blast radius/impact of doing so -- it would mean changing a lot of the index filenames, if not their content-addressed distribution filenames on the CDN.)
Fairly unlikely we would do this, IMO.
One thing we could do here is to allow the caller to optionally provide the distribution name they were expecting to be present, so this function could either use this as a hint during parsing, or just raise an exception when what it found isn't what was expected. I'm not sure how useful this would be broadly, though.
This seems like a reasonable workaround to me, but +1 on the general usefulness concern (it's definitely useful for our needs on pip-audit
, but I'm not sure about anyone else's needs).
I have opened https://github.com/pypa/packaging.python.org/pull/1066 to strengthen the spec around sdist file names. That implicitly updates our docs since we reference it for what the function can parse.
For some additional information. pip is able to install the package because the Simple API also provides the project name in the URL (for cffi, the endpoint is https://pypi.org/simple/cffi/
), so pip is able to “know” that the project name is cffi
instead of cffi-1-0-2
and use that information to assist parsing. Perhaps parse_sdist_filename
could do the same by accepting an optional second argument? (And if packaging
does not, pip-audit
can still add additional logic around the function to handle edge cases for compatibility.)
We should probably loop some of the cffi maintainers to inform them about this.
pip-audit
can still add additional logic around the function to handle edge cases for compatibility.)
Yep, that's what we ended up doing: https://github.com/trailofbits/pip-audit/pull/249
That suggestion otherwise sounds similar to what @di suggested in https://github.com/pypa/packaging/issues/527#issuecomment-1080880967, which sounds good to me! I'm happy to make those changes in a PR if other consumers would find them useful :slightly_smiling_face:
Updated the title to emphasize that this is an ambiguity, not a spec violation per se.
Related: the ambiguous parse here could also probably be avoided by enforcing the current guidance on sdist filenames, which suggests that distribution names be normalized from -
to _
. I opened https://github.com/pypa/packaging.python.org/issues/1077 as part of that, since PyPI and other hosts don't currently perform that normalization.
Well… https://peps.python.org/pep-0625/
Well… https://peps.python.org/pep-0625/
Yep! I was just reading that PEP 🙂. Right now it says that distributions are normalized according to 503, which defers to 426, which means that they can still contain -
. Unless I'm misunderstanding, we'd probably need a slightly stronger normalization to avoid the ambiguous parse in this case.
(But also NB that it's not currently an issue, since all recent sdists are being published with normalized versions. This would strictly be a "cherry on top" in terms of reducing ambiguity.)
You missed a part:
The name of an sdist should be {distribution}-{version}.sdist.
- distribution is the name of the distribution as defined in PEP 345, and normalised according to PEP 503, e.g. 'pip', 'flit-core'.
- version is the version of the distribution as defined in PEP 440, e.g. 20.2.
Each component is escaped according to the same rules as PEP 427.
Specifically:
Each component is escaped according to the same rules as PEP 427.
So the names get normalized as per PEP 503, which will convert any runs of non alphanum characters into a single -
, then PEP 427 escaping -
into _
.
TBH, PEP 503 normalization really only means the name gets lowercased, since PEP 427 implements the exact same (effectively anyways) normalization as PEP 503 does, just uses _
instead of -
, other than PEP 503 adds lower casing as well.
Whoops, I also missed 427.
Just to clarify:
So the names get normalized as per PEP 503, which will convert any runs of non alphanum characters into a single
-
, then PEP 427 escaping-
into_
.
This only applies to wheels, not sdists, correct? Experimentally PyPI does not currently normalize -
into _
for sdist distribution names. Example: https://pypi.org/simple/pip-audit/
(That's where the ambiguity currently appears to me. The current PyPA spec says that that normalization does occur for sdists, whereas it doesn't appear to.)
PEP 427 only applies to wheels, and PEP 625 says “we’ll use the same rules for sdist”.
That's this language, right?
Each component is escaped according to the same rules as PEP 427.
If so, that would mean a format like this, correct?
foo_bar-1.2.3.sdist
If so, that all makes sense so me, and that would be a good improvement 🙂
(This is a significant tangent at this point, but I wonder if we couldn't avoid all of this ad-hoc parsing entirely by updating PEP 503 to include more data-...
attributes. This could be done both both sdists and wheels: have attributes like data-distribution=...
, data-version=...
, and so forth.)
Hopefully all these would go away when we finally standardise the JSON API. Regarding the PEP… happy to hear it all make sense for you, unfortunately not many people are entirely in agreement with the proposal (read the Discourse thread linked in the PEP).
Would a utils.escape_name()
function be useful in this case? I assume it would just be utils.canonicalize_name(name).replace("-", "_")
, but at least that's easier than saying "do PEP 503 with PEP 427". 😉
Would a
utils.escape_name()
function be useful in this case? I assume it would just beutils.canonicalize_name(name).replace("-", "_")
, but at least that's easier than saying "do PEP 503 with PEP 427". 😉
I think it'd be useful for 625, although the current sdist filename format is probably foregone (since PyPI and other hosts don't do that escaping and lots of packages are already hosted with -
in the filename).
the current sdist filename format is probably foregone (since PyPI and other hosts don't do that escaping and lots of packages are already hosted with
-
in the filename).
For old sdists, yes, but there's nothing saying that naming can't be enforced going forward to slowly update the community through attrition.
I opened https://github.com/pypa/packaging/issues/542 for escape_name()
.
I would have to think about it some, but there's a high chance that we could just force normalize sdist names on upload in Warehouse. There's a lower chance, but still reasonable, that we could rename existing sdists to be normalized. You could open a Warehouse ticket about it to figure it out.
Yep, I can open that ticket in a bit.
FWIW, I was reading the write up about this, and FWIW I do in fact think that this is a bug with the current parse_sdist_filename
.
Unfortunately parse_sdist_filename
has made the assumption that you can parse a sdist filename without any additional context, which absent PEP 625 is not true. Without something like PEP 625, we have never required sdists normalize or escape any part of their name.
The sdist page on packaging.python.org suggests that canonicalizing the name and version is a de facto standard, which does not appear to actually be true? Looking through a number of projects, I don't see hardly any canonicalizing really happening.
While I agree that we should be, at a minimum, escaping -
to _
in sdists, there is no actual standard or practice currently that requires it. I believe that this util function has gone unnoticed because I don't think that many projects utilize post releases, and pip itself doesn't use this function so it's not had enough use to have someone trigger the edge case until now.
Unfortunately I don't have a great answer here, parse_sdist_filename
inherently can't work with all currently valid sdist names unless you break compatability to require passing in the additional context of the package name in question. So without breaking compatability, it is a fundamentally broken API. Ideally I guess PEP 625 gets accepted and then at least this function could be documented as only support PEP 625 compatabile file names.
Having parse_sdist_filename
exist in this library is kind of weird in general, since the libraries stated goals are:
This library provides utilities that implement the interoperability specifications which have clearly one correct behaviour (eg: PEP 440) or benefit greatly from having a single shared implementation (eg: PEP 425).
Which doesn't actually exist for sdist filenames.
Unfortunately
parse_sdist_filename
has made the assumption that you can parse a sdist filename without any additional context, which absent PEP 625 is not true.
I think you're right, so we should probably update the docs to say that it implements PEP 625 even though it's not accepted.
Having
parse_sdist_filename
exist in this library is kind of weird in general
I think when we accepted the function we didn't realize how many nasty edge cases that are all technically legal we had missed, or that they are in any way widespread. Maybe this is more motivation to get PEP 625 accepted. 😁
Yea, it happens. packaging is full of them :P (though increasingly less of them!)
I think documenting that it implements PEP 625, possibly with warning that PEP 625 isn't accepted or enforced anywhere yet, so that there are edge cases where it will not correctly parse otherwise currently valid names to warn people is probably a reasonable option.
It's also probably the best option for what to do, since it's somewhat silly to just egregiously break compatibility for an edge case, particularly one that I think is somewhat hard to trigger in practice using modern tooling (IIRC setuptools normalizes the version by default anyways, and the cffi version that actually broke this, isn't a valid package for other reasons).
Oh, I guess to technically support PEP 625, it should be implemented to also support the .sdist
extension that PEP 625 implements, which is probably fine?
Or maybe just document it requires the naming rules from PEP 625, but not the extension.
Or maybe just document it requires the naming rules from PEP 625, but not the extension.
Probably this since adding the file extension is easy enough later in case that doesn't survive the PEP acceptance process.
FYI https://peps.python.org/pep-0625/ has been accepted!