requests icon indicating copy to clipboard operation
requests copied to clipboard

Error when requesting URL which contains emojis or certain characters

Open emilio-cea opened this issue 1 year ago • 6 comments

When performing a GET request to a URL which contains emojis, a redirection occurs in which the location header also contains emojis. From the stacktrace error I believe there's an error when handling redirects if the URL contains certain characters or emojis on it, but further investigation could yield better results.

This is the URL in question: https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/

It can be found on a forum page, where the source HTML contains these emojis and characters: https://www.nulled.to/forum/9-tutorials-guides-ebooks-etc/page-779?prune_day=100&sort_by=Z-A&sort_key=start_date&topicfilter=all

Note that when making the request to the URL, since it's a Cloudflare protected forum, an error 403 can happen in which case, the error mentioned further below does not happen. That's why it leads me to believe the error happens only when a redirection occurs, as the location header which requests is trying to fetch also contains emojis and then, the encoding error happens.

Expected Result

Making the request to the site successfully and returning HTML source code.

Actual Result

An error was raised: UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 48-50: invalid continuation byte

This is the stacktrace:

File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/api.py", line 61, in request
  return session.request(method=method, url=url, **kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 542, in request
  resp = self.send(prep, **send_kwargs)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in send
  history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 677, in <listcomp>
  history = [resp for resp in gen]
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 150, in resolve_redirects
  url = self.get_redirect_target(resp)
File "workdir/env/lib/python3.7/site-packages/requests/sessions.py", line 116, in get_redirect_target
  return to_native_string(location, 'utf8')
File "workdir/env/lib/python3.7/site-packages/requests/_internal_utils.py", line 25, in to_native_string
  out = string.decode(encoding)

Reproduction Steps

import requests
url = "https://www.nulled.to/topic/512174-income-ocean-�-hf-leak-�☀️/"
r=requests.get(url)
print(r.content)

System Information

$ python -m requests.help
{
  "chardet": {
    "version": "4.0.0"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "2.10"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.7.3"
  },
  "platform": {
    "release": "4.19.0-22-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.25.1"
  },
  "system_ssl": {
    "version": "101010ef"
  },
  "urllib3": {
    "version": "1.26.3"
  },
  "using_pyopenssl": false
}

emilio-cea avatar May 10 '23 09:05 emilio-cea

This is related to #3969. We're trying to use utf8 to handle the redirect URL but the translation from bytes to utf8 string is what's failing.

I suspect there's something other than emoji in that url

sigmavirus24 avatar May 10 '23 11:05 sigmavirus24

It seems like the replacement character: U+FFFD REPLACEMENT CHARACTER

emilio-cea avatar May 10 '23 11:05 emilio-cea

And I've seen there are a bunch of issues related to this. The best solution would be to know what encoding the browser does and try to replicate it because on Firefox for instance, it is encoded with something different than UTF8 and no redirections happen but alas, I have not been able to find what encoding is being used

emilio-cea avatar May 10 '23 11:05 emilio-cea

We could fix this by maintaining a list of common encoding types. Wrap the relevant piece of code that is responsible for encoding in a try/ except block. Loop through every encoding type in the array and try to encode the given URL with it. Whatever works will break the loop, and the code will be pretty much bug free.

harris-ahmad avatar Jun 04 '23 00:06 harris-ahmad

Well depending upon what part of the world you're in determines the most common encodings you might encounter. So we'll be looping for a while which would drastically hurt performance.

sigmavirus24 avatar Jun 04 '23 01:06 sigmavirus24

Try to reproduce the error with the same and different URL which contains emojis or certain characters, seems there is a issue with given URL. I can able to get the content with the different URL containing emojis with specific encoding type.

import requests
url = "https://www.example.com/🌟emoji-example🌟"
r = requests.get(url)
content = r.content.decode('ISO-8859-1')
print(content)

MozarM avatar Jun 22 '23 09:06 MozarM