Handle req file decode failures on locale encoding
For the case where:
- a requirements file is encoded as UTF-8, and
- some bytes in the file are incompatible with the system locale
In this case, fallback to decoding as UTF-8 as a last resort (rather than crashing on the UnicodeDecodeError). This behaviour was added when parsing the request file, rather than in auto_decode as it didn't seem to belong in a generic util (though that util looks to only be ever called when parsing requirements files anyway).
Perhaps we should just go straight to UTF-8 without querying the system locale (unless there is a PEP-263 style comment), per the docs[1]:
Requirements files are utf-8 encoding by default
But to avoid a breaking change just warn if decoding with this locale fails then fallback to UTF-8
[1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding
Fixes: #12771
Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?
That way we have a (more or less) non-breaking path to compliance with the docs?
Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.
Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current
locale.getpreferredencoding(False) or sys.getdefaultencoding()with a deprecation warning recommending to add the encoding comment?That way we have a (more or less) non-breaking path to compliance with the docs?
Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.
This sounds reasonable, though I think this change would need to be made in auto_deocde rather than where I made it. But now I am wondering if auto_decode needs to live where it does: it's only ever used for decoding requirements files, so maybe it can just live in req_file
Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.
Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.
:+1: I've updated the change and title+description. It was basically a re-do so I just stomped my previous commits.
per the description: I found and fixed another bug while testing this: requirements files starting with a UTF-32 LE BOM would always be decoded as UTF-16 LE
Oh and sorry for taking so long to re-review your PR. It fell off our radar, and then we haven't had the time to review it honestly.
Latest change was just the result of me rebasing on main and squashing all the fixup commits together
There's a test failure https://github.com/pypa/pip/actions/runs/12528829883/job/34943820232?pr=12795, error:
FAILED tests/functional/test_install.py::test_vcs_url_urlquote_normalization - pip._internal.exceptions.InstallationSubprocessError: bzr checkout --lightweight --quiet http://bazaar.launchpad.net/%7Edjango-wikiapp/django-wikiapp/release-0.1 /tmp/pytest-of-runner/pytest-1/popen-gw1/test_vcs_url_urlquote_normaliz0/cache/release-0.1 exited with 3
It looks like bzr checkout failed with an error? Maybe some intermittent/network issue: it ran ok for all other Python versions+OS (I can't see the button to re-run jobs, I assume due to lack of permissions)
I hit "rerun failed jobs". Let's see if that fixes it.
@pfmoore do you have any objections with the current requirements file specification. I've copied it below for posterity.
The default encoding for requirement files is UTF-8 unless a different encoding is specified using a PEP 263 style comment (e.g. # -*- coding: <encoding name> -*-).
[!warning] pip will fallback to the locale defined encoding if
UTF-8decoding fails. This is a quirk of pip's parser. This behaviour is deprecated and should not be relied upon.
In addition, would you like to formally deprecate the fallback to the locale/system encoding? I'm not particularly interested in hard deprecating the fallback and dealing with the churn, especially since it'll be probably be encountered by inexperienced users who are simply using their OS text editor or whatever.
Honestly, I don't care. As long as we don't break what the current documentation says:
Requirements files are utf-8 encoding by default and also support PEP 263 style comments to change the encoding (i.e. # -- coding:
- -).
then we can do whatever we like. I personally would rather we didn't document any more of the current behaviour, as that just constrains us in the future. And I'm not particularly keen on adding a warning that essentially says "we currently do X, but you shouldn't rely on that" when people shouldn't have been relying on it anyway, as it was undocumented. But it's not something I'm going to spend any time debating - if you want to add it, fair enough.
This looks like this is good to go. Any particular reason you postponed it to 25.1 @ichard26 ?
Nothing in particular. I just wasn't sure if everyone was on board with the changes and I didn't want to merge something that was still under active discussion :)
I'll take a look tonight and merge if you're happy with this.
I personally would rather we didn't document any more of the current behaviour, as that just constrains us in the future.
I don't think that's true in this case? I sympathise that having anything in the documentation lends it credibility and visibility that an undocumented implementation-only detail does not have, but in this case, it's documented explicitly as deprecated behaviour that should not be relied upon. For better or worse, while we are moving towards formal specifications, pip's behaviour is still a de facto standard, so it's important to acknowledge the parts where we diverge from the specification IMO.
I haven't spent the past many years writing, reviewing, or implementing standards though, so I'm happy to defer to your judgement. I'll remove that warning in the documentation.
And I'm not particularly keen on adding a warning that essentially says "we currently do X, but you shouldn't rely on that" when people shouldn't have been relying on it anyway, as it was undocumented.
Well, originally the suggestion was to issue a deprecation warning and remove the fallback outright. I guess we're on the same page it's not really worth it. Similar to the documentation callout, I believe it's valuable to warn on standards violations to nudge towards standard compliance (aka use a PEP 263 comment).