requests icon indicating copy to clipboard operation
requests copied to clipboard

requests can't properly handle redirects if the response body is encoded in something else than 'utf8'

Open lukasz-kapica opened this issue 6 years ago • 17 comments

Just like in the topic. The response body is encoded in iso-8859-2 and the location happens to contain non-ascii character so that it results in UnicodeDecodeError being thrown.

Expected Result

Flawless execution of the code.

Actual Result

UnicodeDecodeError

Reproduction Steps

import requests
requests.get("http://www.biblia.deon.pl/ksiega.php?id=3")

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "3.0.4"
  }, 
  "cryptography": {
    "version": "2.3"
  }, 
  "idna": {
    "version": "2.7"
  }, 
  "implementation": {
    "name": "CPython", 
    "version": "2.7.15+"
  }, 
  "platform": {
    "release": "4.18.0-13-generic", 
    "system": "Linux"
  }, 
  "pyOpenSSL": {
    "openssl_version": "1010100f", 
    "version": "18.0.0"
  }, 
  "requests": {
    "version": "2.19.0"
  }, 
  "system_ssl": {
    "version": "1010100f"
  }, 
  "urllib3": {
    "version": "1.23"
  }, 
  "using_pyopenssl": true
}

This command is only available on Requests v2.16.4 and greater. Otherwise, please provide some basic information about your system (Python version, operating system, &c).

lukasz-kapica avatar Jan 02 '19 23:01 lukasz-kapica

Hi @loocash, would you mind providing the stacktrace so we can see where exactly this is failing?

nateprewitt avatar Jan 02 '19 23:01 nateprewitt

Traceback
 (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 520, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 652, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 652, in <listcomp>
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 141, in resolve_redirects
    url = self.get_redirect_target(resp)
  File "/usr/lib/python3/dist-packages/requests/sessions.py", line 116, in get_redirect_target
    return to_native_string(location, 'utf8')
  File "/usr/lib/python3/dist-packages/requests/_internal_utils.py", line 25, in to_native_string
    out = string.decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb3 in position 19: invalid start byte

lukasz-kapica avatar Jan 02 '19 23:01 lukasz-kapica

The encoding of the response body is irrelevant here. The location header should be strictly ascii encoded. (See eg. https://stackoverflow.com/questions/7654207/what-charset-should-be-used-for-a-location-header-in-a-301-response.)

Requests will (reasonably enough) decode it as utf8, since it is ascii compatible, and ends up being more robust in practice.

In short: The http://www.biblia.deon.pl/ksiega.php?id=3 address is serving an invalid HTTP response.

$ curl -v http://www.biblia.deon.pl/ksiega.php?id=3
*   Trying 104.25.144.117...
* TCP_NODELAY set
* Connected to www.biblia.deon.pl (104.25.144.117) port 80 (#0)
> GET /ksiega.php?id=3 HTTP/1.1
> Host: www.biblia.deon.pl
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 301 Moved Permanently
< Date: Tue, 08 Jan 2019 14:25:32 GMT
< Content-Type: text/html
< Transfer-Encoding: chunked
< Connection: keep-alive
< Set-Cookie: __cfduid=d73c8f399ac453a2e4fe967faaa1251c81546957532; expires=Wed, 08-Jan-20 14:25:32 GMT; path=/; domain=.deon.pl; HttpOnly
< Location: otworz.php?skrot=Kp? 1
< J-Cache: HIT
< Server: cloudflare
< CF-RAY: 495f558234763572-LHR

(As an aside it also doesn't include 'iso-8859-2' in the content-type, so there's really no way to determine what the intended content type of the byte sequence might be)

Requests could decode the header with errors="ignore" or something like that, in order to be more robust against malformed headers, but it'd just be masking the issue that the response header is malformed.

lovelydinosaur avatar Jan 08 '19 14:01 lovelydinosaur

@tomchristie Thank you for answer. Technically speaking it might not be a bug but I will still maintain that this is an expected behaviour from the library which advertises itself as "HTTP for Humans".

Following Python3 code works as expected

import urllib.request
contents = urllib.request.urlopen("http://www.biblia.deon.pl/ksiega.php?id=3").read()
print(contents)

Following Go code works as expected

package main

import (
	"fmt"
	"io/ioutil"
	"log"
	"net/http"
)

func main() {
	resp, err := http.Get("http://www.biblia.deon.pl/ksiega.php?id=3")
	if err != nil {
		log.Fatal(err)
	}
	defer resp.Body.Close()

	body, err := ioutil.ReadAll(resp.Body)
	if err != nil {
		log.Fatal(err)
	}
	fmt.Printf("%s", body)
}

Both of them use only standard library.

lukasz-kapica avatar Jan 08 '19 18:01 lukasz-kapica

So dig into urllib. How does it interpret that byte sequence (where does it redirect to exactly?). Does it just ignore malformed bytes in the location header, or does it do something else?

One resolution here could be to add an errors=... keyword argument to the to_native_string function, and use “ignore” in the get_redirect_target case.

lovelydinosaur avatar Jan 08 '19 22:01 lovelydinosaur

I’d retitle the issue as “Deal with malformed Location header gracefully”.

Erroring is a perfectly legitimate behaviour here, but ignoring the invalid bits of byte sequences might (or might not) be preferable.

lovelydinosaur avatar Jan 08 '19 23:01 lovelydinosaur

Refs #4372

lovelydinosaur avatar Jan 09 '19 10:01 lovelydinosaur

I am running into the same issue and in my case the charset (latin-1) is returned in the Content-Type header. Yet I still get the same error.

I tried the fix from #4933 and that worked for me.

RobReus avatar Feb 12 '19 15:02 RobReus

Encountered the same issue and can confirm that #4933 worked for me also.

putsi avatar Feb 18 '19 17:02 putsi

So, any chance of getting this merged? I'm dealing with websites with international characters and this "feature" is surfacing from time to time interrupting the flow. Browsers deal with malformed location just fine.

StarLightPL avatar Aug 29 '19 11:08 StarLightPL

I have also recently run into this issue and would like to see #4933 merged.

cryzed avatar Jan 19 '20 02:01 cryzed

Hi, I also faced the same issue, with website having a redirect location having special characters. Any plan to merge #4933? This will solve my issue.

k0urge avatar Feb 04 '20 12:02 k0urge

Any plan to merge #4999?

I think you meant #4933 :-)

StarLightPL avatar Feb 04 '20 13:02 StarLightPL

Any plan to merge #4999?

I think you meant #4933 :-)

Thanks @StarLightPL. I just corrected my typo...

k0urge avatar Feb 04 '20 13:02 k0urge

any news on this #4933 ?

Davidriquelme avatar May 02 '20 03:05 Davidriquelme

It is too dangerous to just re-encode a latin-1 string to utf-8. Why a dangerous way to re-encode a string will be "robust"?

blinkspark avatar Jan 06 '21 04:01 blinkspark

I can confirm that https://github.com/psf/requests/pull/4933 solves the issue in my case. Also bad value on location header for redirect.

vkruoso avatar Mar 03 '23 15:03 vkruoso