requests
requests copied to clipboard
When Location decoding fails, fall back to original
Issue #3888 correctly identified Location headers as usually containing UTF-8
codepoints (when not correctly URL encoded), but this is not always the case.
For example the URL
http://www.finanzen.net/suchergebnis.asp?strSuchString=DE0005933931 redirects
to b'/etf/ishares_core_dax\xae_ucits_etf_de', containing the Latin-1 byte for
the ® character.
If UTF-8 decoding fails, it is better to fall back to the original.
This issue was found via https://stackoverflow.com/questions/47113376/python-3-x-requests-redirect-with-unicode-character
Crumbs, tests fail on 2.x because it encodes a bytestring (latin-1 encoded), while Python 3 handles a Unicode value. Returning a native latin-1 string should work there.
Nope, to_native_string() returns a str on Python 2. Suggestions to produce consistent output on 2.x and 3.x appreciated; just returning location.decode('latin1') doesn't work either.
And another thought: Python 3 ends up with UTF8 bytes in the URL-encoded redirection URL regardless of what encoding the server used in the Location header. Surprisingly, this specific server doesn't appear to care (both variants end accepted and return the same response), but for other servers this may necessarily be the same. Most will expect the exact same byte sequence to be used for the next location. How should requests handle those?
All of this is distressingly difficult for us to handle appropriately. The biggest issue is that we do not control header decoding on Python 3 (as noted in the code comments above the change you made), so things get tricky fast.
The core issue though is that we cannot "retain the original": we need to transition the string to a native form. Have you tried using to_native_string(resp.headers['location'], 'latin1') to see if that resolves the test failure?
Have you tried using
to_native_string(resp.headers['location'], 'latin1')to see if that resolves the test failure?
I did, it doesn't, because in Python 2 you'd get a bytestring still. That is then urlencoded to a different representation from the Python 3 Unicode string path.
@mjpieters What is the different urlencoding output in each case?
For the Latin1 å character, Python 2 outputs %E5, Python 3 %C3%A5, so the Latin-1 and UTF-8 bytes URL-encoded.
You can reproduce these in Python 3 with:
>>> from urllib.parse import quote
>>> quote('å', encoding='utf8')
'%C3%A5'
>>> quote('å', encoding='latin')
'%E5'
So, just to be clear: when given a byte string in Python 2, the quote library just quotes its bytes directly. When given a unicode string on Python 3, the quote library encodes it and then quotes the bytes?
Exactly. And you can tell quote() what encoding to use too; the default is UTF-8. So if we can store the encoding for the location header (UTF-8, or if that fails, the fallback to Latin-1) we could use that information to re-encode to the same.
That sounds like it'd be the best approach, if we can swing it.
Any update on this?
Edit: I see there's https://github.com/psf/requests/pull/4933 as well.