packaging icon indicating copy to clipboard operation
packaging copied to clipboard

migrating from distlib.wheel: `packaging.utils.parse_wheel_filename` allows illegal platform tags

Open asottile-sentry opened this issue 2 years ago • 6 comments

I'm looking to migrate a caller from utilizing distlib.wheel to instead use packaging.utils.parse_wheel_filename and my tests encountered an edge case which isn't handled by the packaging version

here's the distlib code: https://github.com/pypa/distlib/blob/91aa92e64ee5d5005901907eaa340a72f18d7212/distlib/wheel.py#L83

packaging's parse_tag doesn't validate the interpreter in any way -- whereas distlib required a particular structure

the test case in question: https://github.com/chriskuehl/dumb-pypi/blob/b3fc62c6519bdb84960d2b491b40ad065fb79a47/tests/main_test.py#L61

playlyfe-0.1.1-2.7.6-none-any.whl should not parse as a valid wheel, however packaging allows it (and parses as three separate tag-triplets: 2-none-any, 7-none-any, 6-none-any)

I propose adding a little bit of validation to parse_tag to check that the tags are supported interpreter versions

I believe this involves something like:

_INTERPRETER_RE = re.compile(fr'^(?:{"|".join(sorted(INTERPRETER_SHORT_NAMES.values()))}\d+')

and then utilizing that regex to validate the names (and throwing an appropriate exception) -- I think a new InvalidTag (extending ValueError) or somesuch exception would need to be created as well (similar to the ones in packaging.utils for the wheel / stdist filename)

wanted to propose this first before diving in and implementing it to save time -- thoughts?

asottile-sentry avatar Jul 18 '22 17:07 asottile-sentry

Missing a trailing ')' in your re, I assume after the closing "}".

_INTERPRETER_RE = re.compile(fr'^(?:{"|".join(sorted(INTERPRETER_SHORT_NAMES.values()))})\d+')

If you are building an re of values from some other source, sorting by descending length will be more likely to give you a unique match, in the case where some short value is a prefix of some longer value.

...sorted(INTERPRETER_SHORT_NAMES.values(), key=len, reverse=True)...

Though probably not an issue in this small group of values (after looking them up in the code).

ptmcg avatar Jul 18 '22 18:07 ptmcg

Unfortunately the best we could do based on my reading of the spec (obviously let me know if I missed something) is test whether the name is normalized as required by https://packaging.python.org/en/latest/specifications/binary-distribution-format/#escaping-and-unicode . Otherwise the interpreter short names are not guaranteed to be the only valid interpreter names, just convenient ones (e.g. Pyston is not covered by that list). We can't make any assumptions as to what interpreter name is considered valid, and so 2.7.6 as a compressed tag of silly interpreter names is still valid.

I can totally see having a higher-level function in another library that requires expected interpreter tags, but as such a low-level library I don't think we can be that pragmatic about what is usually expected and have to stick to the spec in this instance.

brettcannon avatar Jul 18 '22 23:07 brettcannon

PEP 425 adds more color: https://peps.python.org/pep-0425/#python-tag -- that it should be implementation name and then version (or version with an underscore) -- so I believe \w+[0-9][0-9_]* would be a required pattern for a tag

asottile-sentry avatar Jul 19 '22 13:07 asottile-sentry

@brettcannon should this be reconsidered given above?

asottile avatar Jul 21 '22 16:07 asottile

I'm okay with re-opening it, but whether this is a feature request or a bug I'm still torn about. For instance, while PEP 425 says "The version is py_version_nodot," that's entirely CPython-specific as to what that means. As such, do we get to say that it must follow https://peps.python.org/pep-0440/#version-scheme ? What about trailing underscores like your regex allows?

As well, "Other Python implementations should use sys.implementation.name" puts no actual restriction on the format of the name, i.e. there's nothing stopping it from starting with a number. Since the Python interpreter tag is not designed to be parsed but to be an opaque string for code to compare against while still being human-readable makes all of this rather murky.

brettcannon avatar Jul 21 '22 20:07 brettcannon

"Other Python implementations should use sys.implementation.name" puts no actual restriction on the format of the name

It actually does! From https://peps.python.org/pep-0421/:

name A lower-case identifier representing the implementation. Examples include ‘pypy’, ‘jython’, ‘ironpython’, and ‘cpython’.

Based on the discussion so far, I reckon requiring it to either be a f"{shortname}{numbers}" for a shortname we recognise, or an identifier is a reasonable restriction.

pradyunsg avatar Jul 22 '22 17:07 pradyunsg