Bug: `parse_wheel_filename` permits non-normalized versions.
https://packaging.python.org/en/latest/specifications/binary-distribution-format/ says:
Version numbers should be normalised according to the Version specifier specification.
Currently, parse_wheel_filename will raise InvalidWheelFilename for some invalid filenames, but https://packaging.python.org/en/latest/specifications/version-specifiers/#normalization has a long list of version normalizations that parse_wheel_filename does not enforce:
>>> from packaging.utils import parse_wheel_filename
>>> parse_wheel_filename('foo-01.0.0-py3-none-any.whl')
('foo', <Version('1.0.0')>, (), frozenset({<py3-none-any @ 140365619033920>}))
Does packaging have any sort of deprecation process for this sort of change? i.e. a public function that throws an error where it didn't used to.
Pip is migrating to use parse_wheel_filename as the primary way to parse wheel filenames in 25.1 and I'm concerned on impact to users.
I'm still learning how pip migrates to new packaging versions, I guess we could patch packaging to throw a warning instead of an exception for an appropriate deprecation period.
@notatallshaw You may be interested in this branch I am working on to provide a better API here around creating/parsing filenames, including "strict" and "non-strict" parsing: https://github.com/pypa/packaging/compare/main...di:packaging:filename-api
From a cursory glance, it appears that most build backends are producing wheels with normalized filenames, so I wouldn't expect this to be too disruptive, but I could also be convinced to leave parse_wheel_filename alone until we have a better way to introduce this change.
Does packaging have any sort of deprecation process for this sort of change?
We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.
From a cursory glance, it appears that most build backends are producing wheels with normalized filenames, so I wouldn't expect this to be too disruptive, but I could also be convinced to leave parse_wheel_filename alone until we have a better way to introduce this change.
I don't want to hold up packaging if this is a clear win for most use cases. But a bit of background:
Currently pip is using a custom regex to parse wheel filenames: https://github.com/pypa/pip/blob/25.0.1/src/pip/_internal/models/wheel.py#L21
There was a use case pip explicitly supported where _ was allowed in the version part, so we entered a depreciation cycle. So far I've seen two reports of users hitting this (https://github.com/pypa/pip/issues/12938#issuecomment-2469681012 and https://github.com/quic/aimet/issues/3743)
These two use cases suggest the two scenarios are users hand rolling their own wheel filenames on private indexes / filesystems and perhaps an edge case in scikit-build-core, again for a project not uploaded to PyPI.
In general when pip is consuming wheel filenames I would rather it be non-strict in cases where the filename can be unambiguously normalized. But I also suppose we could lengthen the depreciation cycle and flag all issues to users (outside the explicitly supported _ in the version part). I would need to do some investigation on what the difference is between this additional strictness here and pip's current regex.
We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.
Taking a look at all filenames uploaded in the last month:
COPY (
SELECT
fe.time AS timestamp,
fe.additional->>'filename' AS filename
FROM
file_events fe
WHERE
fe.time >= NOW() - INTERVAL '1 month'
AND fe.tag = 'file:add'
AND fe.additional->>'filename' LIKE '%.whl'
ORDER BY fe.time
) TO STDOUT WITH (FORMAT CSV, HEADER);
Analyzing with:
import csv
from packaging.utils import InvalidWheelFilename, parse_wheel_filename
total_filenames = 0
exception_count = 0
with open("out.csv") as file:
reader = csv.DictReader(file) # Assumes CSV has a header
for row in reader:
total_filenames += 1
filename = row["filename"]
try:
parse_wheel_filename(filename)
except InvalidWheelFilename:
exception_count += 1
percentage_exceptions = (exception_count / total_filenames) * 100
print(f"Total filenames processed: {total_filenames}")
print(f"Total exceptions: {exception_count}")
print(f"Percentage of exceptions: {percentage_exceptions:.2f}%")
With this branch, we get:
Total filenames processed: 210589
Total exceptions: 24
Percentage of exceptions: 0.01%
The invalid filenames in question are:
anki-25.01b1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.01b1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.01b1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.01b1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.01b1-cp39-abi3-win_amd64.whl
anki-25.01rc1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.01rc1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.01rc1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.01rc1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.01rc1-cp39-abi3-win_amd64.whl
anki-25.02-cp39-abi3-macosx_12_0_arm64.whl
anki-25.02-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.02-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.02-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.02-cp39-abi3-win_amd64.whl
anki-25.02rc1-cp39-abi3-macosx_12_0_arm64.whl
anki-25.02rc1-cp39-abi3-macosx_12_0_x86_64.whl
anki-25.02rc1-cp39-abi3-manylinux_2_35_aarch64.whl
anki-25.02rc1-cp39-abi3-manylinux_2_35_x86_64.whl
anki-25.02rc1-cp39-abi3-win_amd64.whl
aqt-25.01b1-py3-none-any.whl
aqt-25.01rc1-py3-none-any.whl
aqt-25.02-py3-none-any.whl
aqt-25.02rc1-py3-none-any.whl
So it's really only two projects that are producing wheels with non-normalized versions, not 24 projects.
I can expand the analysis to a larger timeframe if anyone would like?
Also, seems reasonable to add a strict flag to parse_wheel_filename that disables this check, which pip could optionally set to handle their own deprecation period.
We can raise a warning for a while if an analysis of PyPI suggests it would be disruptive.
So it's really only two projects that are producing wheels with non-normalized versions
While helpful information, my experience supporting pip is that it's users who are not uploading to PyPI that are caught by changes to more strictly comply with the spec.
Also, seems reasonable to add a
strictflag toparse_wheel_filenamethat disables this check, whichpipcould optionally set to handle their own deprecation period.
I think this would make it easier, and I definitely like the look of the filename API you linked earlier.
All that said, please don't consider it a show stopper, looking at all this code has made me think I need to re-review this and check additional scenarios before migrating to parse_wheel_filename, as I said we can always lengthen the depreciation cycle or patch packaging.
All that said, please don't consider it a show stopper, looking at all this code has made me think I need to re-review this and check additional scenarios before migrating to
parse_wheel_filename, as I said we can always lengthen the depreciation cycle or patch packaging.
If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library. Note that parse_wheel_filename also doesn't enforce that the project name is normalized (a much more common issue) and that it's partner parse_sdist_filename has a bunch of issues: https://github.com/pypa/packaging/issues/527.
If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library. Note that
parse_wheel_filenamealso doesn't enforce that the project name is normalized
We started the depreciation in pip 24.3 due to be complete in pip 25.1 (the next release), I'll discuss with the other pip maintainers, thanks.
and that it's partner
parse_sdist_filenamehas a bunch of issues
I've not looked at sdists in general yet, that seemed like a much bigger can of worms. The only place I think we're using parse_sdist_filename right now is to filter filenames when a user points pip to a file directory to find packages, I added that before I understood that sdist and wheel names were tricky, fortunately that was done in pip 24.0 (a year ago) and we've not seen any complaints.
If the migration hasn't started yet, I would recommend waiting until we can get a better API for parsing filenames into this library.
What API are you thinking of?
What API are you thinking of?
The one I mentioned here:
@notatallshaw You may be interested in this branch I am working on to provide a better API here around creating/parsing filenames, including "strict" and "non-strict" parsing: https://github.com/pypa/packaging/compare/main...di:packaging:filename-api
I'm also OK waiting on that (or something like that) to land instead of merging #874, we can add this check to PyPI in the interim instead.
What API are you thinking of?
The one I mentioned here:
FYI that won't work as written as it doesn't handle compressed wheel tags (it's why https://packaging.pypa.io/en/stable/utils.html#packaging.utils.parse_wheel_filename returns a set of tags instead of a single one).
FYI I changed the approach to migrate to this function in pip: https://github.com/pypa/pip/pull/13229
parse_wheel_filename will be the default way to parse wheel filenames, but if parsing fails it falls back to the old parsing logic during the deprecation period.
Rereading the spec I do see that it says names and versions of binary distributions should be normalized. Because pip now has a fallback we can extend the deprecation period if this function becomes strict about this if this isn't released before the next pip release.
What API are you thinking of?
The one I mentioned here:
FYI that won't work as written as it doesn't handle compressed wheel tags (it's why https://packaging.pypa.io/en/stable/utils.html#packaging.utils.parse_wheel_filename returns a set of tags instead of a single one).
Ack, will take that into account, it's still a WIP as I'm hoping to integrate it into PyPI first as a proof-of concept, but I'm currently blocked on other things that haven't landed yet.
I see @di's PR for setuptools to produce normalized wheel file names just released: https://github.com/pypa/setuptools/pull/4766
I'd assumed that build tooling had largely migrated to using normalized names, this broke a bunch of tests in pip's CI, I'm sure in many private repos they will not be strictly following the spec on name normalization.
It's up to packaging what to do with this function, but from pip's perspective this makes me nervous about being strict about name normalization. If packaging becomes strict here, with no non-strict option, I think pip should hold off on migrating for several years.
I agree. I'm closing https://github.com/pypa/packaging/pull/874 for now in favor of a new API here that supports both strict and non-strict parsing for a gentler migration.
Just a quick update to this, as I had conversations about this change in other settings, and it was pointed out to me the spec does call out tools that consume wheels:
Tools consuming wheels must be prepared to accept
.(FULL STOP) and uppercase letters, however, as these were allowed by an earlier version of this specification.
So any update or replacement to parse_wheel_filename should ideally have both the standard behavior and this behavior.
@di did you still have plans for that "new API here that supports both strict and non-strict parsing for a gentler migration."?
I am still planning on working on this, but found a few instances where PyPI was permitting non-standard filenames and have been distracted fixing those. Happy to collaborate though if this is on your radar as well.
No plans on my end; just going through old issues to see which ones to close or keep open.