requests
requests copied to clipboard
requests can't properly handle redirects if the response body is encoded in something else than 'utf8'
Just like in the topic. The response body is encoded in iso-8859-2 and the location happens to contain non-ascii character so that it results in UnicodeDecodeError being thrown.
Expected Result
Flawless execution of the code.
Actual Result
UnicodeDecodeError
Reproduction Steps
import requests
requests.get("http://www.biblia.deon.pl/ksiega.php?id=3")
System Information
$ python -m requests.help
{
"chardet": {
"version": "3.0.4"
},
"cryptography": {
"version": "2.3"
},
"idna": {
"version": "2.7"
},
"implementation": {
"name": "CPython",
"version": "2.7.15+"
},
"platform": {
"release": "4.18.0-13-generic",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "1010100f",
"version": "18.0.0"
},
"requests": {
"version": "2.19.0"
},
"system_ssl": {
"version": "1010100f"
},
"urllib3": {
"version": "1.23"
},
"using_pyopenssl": true
}
This command is only available on Requests v2.16.4 and greater. Otherwise, please provide some basic information about your system (Python version, operating system, &c).
Hi @loocash, would you mind providing the stacktrace so we can see where exactly this is failing?
Traceback
(most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/usr/lib/python3/dist-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 652, in send
history = [resp for resp in gen] if allow_redirects else []
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 652, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 141, in resolve_redirects
url = self.get_redirect_target(resp)
File "/usr/lib/python3/dist-packages/requests/sessions.py", line 116, in get_redirect_target
return to_native_string(location, 'utf8')
File "/usr/lib/python3/dist-packages/requests/_internal_utils.py", line 25, in to_native_string
out = string.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 19: invalid start byte
The encoding of the response body is irrelevant here. The location header should be strictly ascii encoded. (See eg. https://stackoverflow.com/questions/7654207/what-charset-should-be-used-for-a-location-header-in-a-301-response.)
Requests will (reasonably enough) decode it as utf8, since it is ascii compatible, and ends up being more robust in practice.
In short: The http://www.biblia.deon.pl/ksiega.php?id=3 address is serving an invalid HTTP response.
$ curl -v http://www.biblia.deon.pl/ksiega.php?id=3
* Trying 104.25.144.117...
* TCP_NODELAY set
* Connected to www.biblia.deon.pl (104.25.144.117) port 80 (#0)
> GET /ksiega.php?id=3 HTTP/1.1
> Host: www.biblia.deon.pl
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 301 Moved Permanently
< Date: Tue, 08 Jan 2019 14:25:32 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=d73c8f399ac453a2e4fe967faaa1251c81546957532; expires=Wed, 08-Jan-20 14:25:32 GMT; path=/; domain=.deon.pl; HttpOnly
< Location: otworz.php?skrot=Kp? 1
< J-Cache: HIT
< Server: cloudflare
< CF-RAY: 495f558234763572-LHR
(As an aside it also doesn't include 'iso-8859-2' in the content-type, so there's really no way to determine what the intended content type of the byte sequence might be)
Requests could decode the header with errors="ignore" or something like that, in order to be more robust against malformed headers, but it'd just be masking the issue that the response header is malformed.
@tomchristie Thank you for answer. Technically speaking it might not be a bug but I will still maintain that this is an expected behaviour from the library which advertises itself as "HTTP for Humans".
Following Python3 code works as expected
import urllib.request
contents = urllib.request.urlopen("http://www.biblia.deon.pl/ksiega.php?id=3").read()
print(contents)
Following Go code works as expected
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
resp, err := http.Get("http://www.biblia.deon.pl/ksiega.php?id=3")
if err != nil {
log.Fatal(err)
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s", body)
}
Both of them use only standard library.
So dig into urllib. How does it interpret that byte sequence (where does it redirect to exactly?). Does it just ignore malformed bytes in the location header, or does it do something else?
One resolution here could be to add an errors=... keyword argument to the to_native_string function, and use “ignore” in the get_redirect_target case.
I’d retitle the issue as “Deal with malformed Location header gracefully”.
Erroring is a perfectly legitimate behaviour here, but ignoring the invalid bits of byte sequences might (or might not) be preferable.
Refs #4372
I am running into the same issue and in my case the charset (latin-1) is returned in the Content-Type header. Yet I still get the same error.
I tried the fix from #4933 and that worked for me.
Encountered the same issue and can confirm that #4933 worked for me also.
So, any chance of getting this merged? I'm dealing with websites with international characters and this "feature" is surfacing from time to time interrupting the flow. Browsers deal with malformed location just fine.
I have also recently run into this issue and would like to see #4933 merged.
Hi, I also faced the same issue, with website having a redirect location having special characters. Any plan to merge #4933? This will solve my issue.
Any plan to merge #4999?
I think you meant #4933 :-)
Any plan to merge #4999?
I think you meant #4933 :-)
Thanks @StarLightPL. I just corrected my typo...
any news on this #4933 ?
It is too dangerous to just re-encode a latin-1 string to utf-8. Why a dangerous way to re-encode a string will be "robust"?
I can confirm that https://github.com/psf/requests/pull/4933 solves the issue in my case. Also bad value on location header for redirect.