packaging icon indicating copy to clipboard operation
packaging copied to clipboard

`Version` and `Specifier` accept (erroneously) some non-ASCII letters in the *local version* segment

Open zuo opened this issue 3 years ago • 2 comments

Reproducing the behavior concerning packaging.version.Version:

Python 3.9.7 (default, Oct  4 2021, 18:09:29) 
[...]
>>> import packaging.version
>>> packaging.version.Version('1.2+\u0130\u0131\u017f\u212a')
<Version('1.2+i̇ıſk')>

The cause is that packaging.version.VERSION_PATTERN makes use of a-z character ranges in conjunction with re.IGNORECASE and (implicit in Python 3.x) re.UNICODE (see the 2nd paragraph of this fragment: https://docs.python.org/3/library/re.html#re.IGNORECASE).

It can be fixed in one of the following two ways:

  • either by adding re.ASCII to flags (but then both occurrences of \s* in the actual regex will be restricted to match ASCII-only whitespace characters!);
  • or by removing re.IGNORECASE from flags and replacing (in VERSION_PATTERN) both occurrences of a-z with A-Za-z plus adding suitable upper-case alternatives in the pre_l, post_l and dev_l regex groups, e.g., [aA][lL][pP][hH][aA] in place of alpha (quite cumbersome...).

zuo avatar Oct 22 '21 19:10 zuo

The whitespace issue can probably be worked around by doing that detection separately from the actual parsing.

class Version:
    _regex = re.compile(VERSION_PATTERN, re.VERBOSE | re.IGNORECASE | re.ASCII)

    def __init__(self, version: str) -> None:
        match = self._regex.match(version.strip())
        # ... The rest is the same ...

uranusjr avatar Oct 22 '21 19:10 uranusjr

Reproducing the behavior concerning packaging.specifiers.Specifier:

Python 3.9.7 (default, Oct  4 2021, 18:09:29) 
[...]
>>> import packaging.specifiers
>>> packaging.specifiers.Specifier('==1.2+\u0130\u0131\u017f\u212a')
<Specifier('==1.2+İıſK')>

The cause is the same as the aforementioned Version-related one, except that here it relates to (non-public) Specifier._regex_str and Specifier._regex regular expression definitions.

In the case of these regexes, \s occurs in various places of the effective pattern (not only at its start and end), so here the solution based on re.ASCII in conjunction with the .strip()-based value preparation (proposed above by @uranusjr in regard to Version) cannot be applied without restricting matching of white space characters (including the negated character range in the case of the === operator...) to ASCII-only ones; which -- I suppose -- would be too disruptive.

Instead of that, an additional check can be performed when the operator is not '===' -- something along the lines of:

without_whitespace = ''.join(spec.split())
if not without_whitespace.isascii():
    raise InvalidSpecifier...

zuo avatar Oct 23 '21 23:10 zuo