pip Handle req file decode failures on locale encoding

For the case where:

a requirements file is encoded as UTF-8, and
some bytes in the file are incompatible with the system locale

In this case, fallback to decoding as UTF-8 as a last resort (rather than crashing on the UnicodeDecodeError). This behaviour was added when parsing the request file, rather than in auto_decode as it didn't seem to belong in a generic util (though that util looks to only be ever called when parsing requirements files anyway).

Perhaps we should just go straight to UTF-8 without querying the system locale (unless there is a PEP-263 style comment), per the docs[1]:

Requirements files are utf-8 encoding by default

But to avoid a breaking change just warn if decoding with this locale fails then fallback to UTF-8

[1] https://pip.pypa.io/en/stable/reference/requirements-file-format/#encoding

Fixes: #12771

Jun 25 '24 17:06 matthewhughes934

Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?

That way we have a (more or less) non-breaking path to compliance with the docs?

Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.

Oct 13 '24 11:10 sbidoul

Hmm, since the documentation says it's utf-8 unless there is a PEP-263 style comment, shouldn't we rather decode as utf8 is there is no such comment, and if that fails, fallback to the current locale.getpreferredencoding(False) or sys.getdefaultencoding() with a deprecation warning recommending to add the encoding comment?

That way we have a (more or less) non-breaking path to compliance with the docs?

Also, I'd put all that in auto_decode, with a docstring comment that the function is meant for requirements.txt decoding as per the docs.

This sounds reasonable, though I think this change would need to be made in auto_deocde rather than where I made it. But now I am wondering if auto_decode needs to live where it does: it's only ever used for decoding requirements files, so maybe it can just live in req_file

Oct 14 '24 18:10 matthewhughes934

Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.

Oct 20 '24 16:10 sbidoul

Sounds good. I've removed from the 24.3 milestone. Feel free to ping me when you get back to this.

:+1: I've updated the change and title+description. It was basically a re-do so I just stomped my previous commits.

per the description: I found and fixed another bug while testing this: requirements files starting with a UTF-32 LE BOM would always be decoded as UTF-16 LE

Oct 22 '24 20:10 matthewhughes934

Oh and sorry for taking so long to re-review your PR. It fell off our radar, and then we haven't had the time to review it honestly.

Dec 26 '24 00:12 ichard26

Latest change was just the result of me rebasing on main and squashing all the fixup commits together

Dec 28 '24 18:12 matthewhughes934

There's a test failure https://github.com/pypa/pip/actions/runs/12528829883/job/34943820232?pr=12795, error:

FAILED tests/functional/test_install.py::test_vcs_url_urlquote_normalization - pip._internal.exceptions.InstallationSubprocessError: bzr checkout --lightweight --quiet http://bazaar.launchpad.net/%7Edjango-wikiapp/django-wikiapp/release-0.1 /tmp/pytest-of-runner/pytest-1/popen-gw1/test_vcs_url_urlquote_normaliz0/cache/release-0.1 exited with 3

It looks like bzr checkout failed with an error? Maybe some intermittent/network issue: it ran ok for all other Python versions+OS (I can't see the button to re-run jobs, I assume due to lack of permissions)

Dec 28 '24 19:12 matthewhughes934

I hit "rerun failed jobs". Let's see if that fixes it.

Dec 28 '24 20:12 pfmoore

@pfmoore do you have any objections with the current requirements file specification. I've copied it below for posterity.

The default encoding for requirement files is UTF-8 unless a different encoding is specified using a PEP 263 style comment (e.g. # -*- coding: <encoding name> -*-).

[!warning] pip will fallback to the locale defined encoding if UTF-8 decoding fails. This is a quirk of pip's parser. This behaviour is deprecated and should not be relied upon.

In addition, would you like to formally deprecate the fallback to the locale/system encoding? I'm not particularly interested in hard deprecating the fallback and dealing with the churn, especially since it'll be probably be encountered by inexperienced users who are simply using their OS text editor or whatever.

Jan 23 '25 18:01 ichard26

Honestly, I don't care. As long as we don't break what the current documentation says:

Requirements files are utf-8 encoding by default and also support PEP 263 style comments to change the encoding (i.e. # -- coding: --).

then we can do whatever we like. I personally would rather we didn't document any more of the current behaviour, as that just constrains us in the future. And I'm not particularly keen on adding a warning that essentially says "we currently do X, but you shouldn't rely on that" when people shouldn't have been relying on it anyway, as it was undocumented. But it's not something I'm going to spend any time debating - if you want to add it, fair enough.

Jan 23 '25 21:01 pfmoore

This looks like this is good to go. Any particular reason you postponed it to 25.1 @ichard26 ?

Jan 24 '25 18:01 sbidoul

Nothing in particular. I just wasn't sure if everyone was on board with the changes and I didn't want to merge something that was still under active discussion :)

I'll take a look tonight and merge if you're happy with this.

Jan 24 '25 18:01 ichard26

I personally would rather we didn't document any more of the current behaviour, as that just constrains us in the future.

I don't think that's true in this case? I sympathise that having anything in the documentation lends it credibility and visibility that an undocumented implementation-only detail does not have, but in this case, it's documented explicitly as deprecated behaviour that should not be relied upon. For better or worse, while we are moving towards formal specifications, pip's behaviour is still a de facto standard, so it's important to acknowledge the parts where we diverge from the specification IMO.

I haven't spent the past many years writing, reviewing, or implementing standards though, so I'm happy to defer to your judgement. I'll remove that warning in the documentation.

And I'm not particularly keen on adding a warning that essentially says "we currently do X, but you shouldn't rely on that" when people shouldn't have been relying on it anyway, as it was undocumented.

Well, originally the suggestion was to issue a deprecation warning and remove the fallback outright. I guess we're on the same page it's not really worth it. Similar to the documentation callout, I believe it's valuable to warn on standards violations to nudge towards standard compliance (aka use a PEP 263 comment).

Jan 24 '25 22:01 ichard26