requests icon indicating copy to clipboard operation
requests copied to clipboard

`get_encoding_from_headers` fails if charset name not specified

Open batterseapower opened this issue 1 year ago • 2 comments

requests.utils.get_encoding_from_headers assumes that the charset parameter always specifies a name. In very rare cases a server can send a malformed content-type header which does not specify a name. In these cases, requests should probably just treat it as if no charset had been specified.

Expected Result

requests.utils.get_encoding_from_headers({'content-type': 'text/html; charset'}) == 'ISO-8859-1'

Actual Result

File ~/opt/anaconda3/2023.03/envs/mamba/envs/py3/lib/python3.9/site-packages/requests/utils.py:553, in get_encoding_from_headers(headers)
    550 content_type, params = _parse_content_type_header(content_type)
    552 if "charset" in params:
--> 553     return params["charset"].strip("'\"")
    555 if "text" in content_type:
    556     return "ISO-8859-1"

AttributeError: 'bool' object has no attribute 'strip'

System Information

{
  "chardet": {
    "version": "4.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.4"
  },
  "cryptography": {
    "version": "41.0.3"
  },
  "idna": {
    "version": "3.4"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.15"
  },
  "platform": {
    "release": "5.14.0-284.11.1.el9_2.x86_64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "1010116f",
    "version": "23.2.0"
  },
  "requests": {
    "version": "2.31.0"
  },
  "system_ssl": {
    "version": "1010117f"
  },
  "urllib3": {
    "version": "1.26.18"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}

batterseapower avatar Feb 22 '24 23:02 batterseapower

Hello @batterseapower I just pushed a PR to fix this issue. It is my first PR in this project. Let's wait for project mantainer to validate my fix.

Best Regards

alain-khalil avatar Mar 08 '24 17:03 alain-khalil

I wonder if _parse_content_type_header should be changed so that it ignores parameters with no equals after them. Or sets them to the empty string, or None. Setting parameter values to a bool is clearly wrong. src/requests/utils.py#L533

I have checked RFC 2045, RFC 2616, RFC 7231, RFC 9110 and they all define a parameter as essentially parameter = parameter-name "=" parameter-value, so a parameter with no equals character is technically invalid (I think?).

Comparing what some other implementations do: mimeparse (their implementation is taken directly from deprecated/removed built-in cgi module, so should match what built-in cgi module used to do):

>>> from mimeparse import parse_mime_type
>>> parse_mime_type("application/json; charset")
('application', 'json', {})

stdlib email.policy.EmailPolicy (tested using code from this SO answer):

>>> def parse_content_type(content_type):
...     from email.policy import EmailPolicy
...     header = EmailPolicy.header_factory('content-type', content_type)
...     return (header.content_type, dict(header.params))
...
>>> parse_content_type("application/json; charset")
('application/json', {'charset': ''})

stdlib email.message.Message (tested using code from this SO answer):

>>> from email.message import Message
>>>
>>> _CONTENT_TYPE = "content-type"
>>>
>>> def parse_content_type(content_type: str) -> tuple[str, dict[str,str]]:
...     email = Message()
...     email[_CONTENT_TYPE] = content_type
...     params = email.get_params()
...     # The first param is the mime-type the later ones are the attributes like "charset"
...     return params[0][0], dict(params[1:])
...
>>> parse_content_type("application/json; charset")
('application/json', {'charset': ''})

(Also, checking those implementations, you can see that they are more correct about quoted strings -- matching quotes/unquoting -- but requests' simpler version of just splitting on ";" and stripping any quote characters has been around for a long time and apparently not caused problems, so...)

x11x avatar Sep 01 '24 04:09 x11x