requests When Location decoding fails, fall back to original

Issue #3888 correctly identified Location headers as usually containing UTF-8 codepoints (when not correctly URL encoded), but this is not always the case. For example the URL http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects to b'/etf/ishares_core_dax\xae_ucits_etf_de', containing the Latin-1 byte for the ® character.

If UTF-8 decoding fails, it is better to fall back to the original.

This issue was found via https://stackoverflow.com/questions/47113376/python-3-x-requests-redirect-with-unicode-character

Nov 04 '17 18:11 mjpieters

Crumbs, tests fail on 2.x because it encodes a bytestring (latin-1 encoded), while Python 3 handles a Unicode value. Returning a native latin-1 string should work there.

Nov 04 '17 18:11 mjpieters

Nope, to_native_string() returns a str on Python 2. Suggestions to produce consistent output on 2.x and 3.x appreciated; just returning location.decode('latin1') doesn't work either.

Nov 04 '17 18:11 mjpieters

And another thought: Python 3 ends up with UTF8 bytes in the URL-encoded redirection URL regardless of what encoding the server used in the Location header. Surprisingly, this specific server doesn't appear to care (both variants end accepted and return the same response), but for other servers this may necessarily be the same. Most will expect the exact same byte sequence to be used for the next location. How should requests handle those?

Nov 04 '17 21:11 mjpieters

All of this is distressingly difficult for us to handle appropriately. The biggest issue is that we do not control header decoding on Python 3 (as noted in the code comments above the change you made), so things get tricky fast.

The core issue though is that we cannot "retain the original": we need to transition the string to a native form. Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?

Nov 05 '17 09:11 Lukasa

Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?

I did, it doesn't, because in Python 2 you'd get a bytestring still. That is then urlencoded to a different representation from the Python 3 Unicode string path.

Nov 05 '17 16:11 mjpieters

@mjpieters What is the different urlencoding output in each case?

Nov 05 '17 16:11 Lukasa

For the Latin1 å character, Python 2 outputs %E5, Python 3 %C3%A5, so the Latin-1 and UTF-8 bytes URL-encoded.

You can reproduce these in Python 3 with:

>>> from urllib.parse import quote
>>> quote('å', encoding='utf8')
'%C3%A5'
>>> quote('å', encoding='latin')
'%E5'

Nov 07 '17 14:11 mjpieters

So, just to be clear: when given a byte string in Python 2, the quote library just quotes its bytes directly. When given a unicode string on Python 3, the quote library encodes it and then quotes the bytes?

Nov 07 '17 20:11 Lukasa

Exactly. And you can tell quote() what encoding to use too; the default is UTF-8. So if we can store the encoding for the location header (UTF-8, or if that fails, the fallback to Latin-1) we could use that information to re-encode to the same.

Nov 09 '17 15:11 mjpieters

That sounds like it'd be the best approach, if we can swing it.

Nov 10 '17 21:11 Lukasa

Any update on this?

Edit: I see there's https://github.com/psf/requests/pull/4933 as well.

Jan 19 '20 02:01 imtbl