requests
requests copied to clipboard
Error when requesting URL which contains emojis or certain characters
When performing a GET request to a URL which contains emojis, a redirection occurs in which the location
header also contains emojis. From the stacktrace error I believe there's an error when handling redirects if the URL contains certain characters or emojis on it, but further investigation could yield better results.
This is the URL in question: https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/
It can be found on a forum page, where the source HTML contains these emojis and characters:
https://www.nulled.to/forum/9-tutorials-guides-ebooks-etc/page-779?prune_day=100&sort_by=Z-A&sort_key=start_date&topicfilter=all
Note that when making the request to the URL, since it's a Cloudflare protected forum, an error 403 can happen in which case, the error mentioned further below does not happen. That's why it leads me to believe the error happens only when a redirection occurs, as the location
header which requests is trying to fetch also contains emojis and then, the encoding error happens.
Expected Result
Making the request to the site successfully and returning HTML source code.
Actual Result
An error was raised: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-50: invalid continuation byte
This is the stacktrace:
File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 76, in get
return request('get', url, params=params, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 61, in request
return session.request(method=method, url=url, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in send
history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in <listcomp>
history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 150, in resolve_redirects
url = self.get_redirect_target(resp)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 116, in get_redirect_target
return to_native_string(location, 'utf8')
File "workdir/env/lib/python3.7/site-packages/requests/_internal_utils.py", line 25, in to_native_string
out = string.decode(encoding)
Reproduction Steps
import requests
url = "https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/"
r=requests.get(url)
print(r.content)
System Information
$ python -m requests.help
{
"chardet": {
"version": "4.0.0"
},
"cryptography": {
"version": ""
},
"idna": {
"version": "2.10"
},
"implementation": {
"name": "CPython",
"version": "3.7.3"
},
"platform": {
"release": "4.19.0-22-amd64",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "",
"version": null
},
"requests": {
"version": "2.25.1"
},
"system_ssl": {
"version": "101010ef"
},
"urllib3": {
"version": "1.26.3"
},
"using_pyopenssl": false
}
This is related to #3969. We're trying to use utf8 to handle the redirect URL but the translation from bytes to utf8 string is what's failing.
I suspect there's something other than emoji in that url
It seems like the replacement character: U+FFFD REPLACEMENT CHARACTER
And I've seen there are a bunch of issues related to this. The best solution would be to know what encoding the browser does and try to replicate it because on Firefox for instance, it is encoded with something different than UTF8 and no redirections happen but alas, I have not been able to find what encoding is being used
We could fix this by maintaining a list of common encoding types. Wrap the relevant piece of code that is responsible for encoding in a try/ except block. Loop through every encoding type in the array and try to encode the given URL with it. Whatever works will break the loop, and the code will be pretty much bug free.
Well depending upon what part of the world you're in determines the most common encodings you might encounter. So we'll be looping for a while which would drastically hurt performance.
Try to reproduce the error with the same and different URL which contains emojis or certain characters, seems there is a issue with given URL. I can able to get the content with the different URL containing emojis with specific encoding type.
import requests
url = "https://www.example.com/🌟emoji-example🌟"
r = requests.get(url)
content = r.content.decode('ISO-8859-1')
print(content)