requests
requests copied to clipboard
`get_encoding_from_headers` fails if charset name not specified
requests.utils.get_encoding_from_headers
assumes that the charset parameter always specifies a name. In very rare cases a server can send a malformed content-type header which does not specify a name. In these cases, requests should probably just treat it as if no charset had been specified.
Expected Result
requests.utils.get_encoding_from_headers({'content-type': 'text/html; charset'}) == 'ISO-8859-1'
Actual Result
File ~/opt/anaconda3/2023.03/envs/mamba/envs/py3/lib/python3.9/site-packages/requests/utils.py:553, in get_encoding_from_headers(headers)
550 content_type, params = _parse_content_type_header(content_type)
552 if "charset" in params:
--> 553 return params["charset"].strip("'\"")
555 if "text" in content_type:
556 return "ISO-8859-1"
AttributeError: 'bool' object has no attribute 'strip'
System Information
{
"chardet": {
"version": "4.0.0"
},
"charset_normalizer": {
"version": "2.0.4"
},
"cryptography": {
"version": "41.0.3"
},
"idna": {
"version": "3.4"
},
"implementation": {
"name": "CPython",
"version": "3.9.15"
},
"platform": {
"release": "5.14.0-284.11.1.el9_2.x86_64",
"system": "Linux"
},
"pyOpenSSL": {
"openssl_version": "1010116f",
"version": "23.2.0"
},
"requests": {
"version": "2.31.0"
},
"system_ssl": {
"version": "1010117f"
},
"urllib3": {
"version": "1.26.18"
},
"using_charset_normalizer": false,
"using_pyopenssl": true
}
Hello @batterseapower I just pushed a PR to fix this issue. It is my first PR in this project. Let's wait for project mantainer to validate my fix.
Best Regards
I wonder if _parse_content_type_header
should be changed so that it ignores parameters with no equals after them. Or sets them to the empty string, or None. Setting parameter values to a bool
is clearly wrong.
src/requests/utils.py#L533
I have checked RFC 2045, RFC 2616, RFC 7231, RFC 9110 and they all define a parameter as essentially parameter = parameter-name "=" parameter-value
, so a parameter with no equals character is technically invalid (I think?).
Comparing what some other implementations do:
mimeparse (their implementation is taken directly from deprecated/removed built-in cgi
module, so should match what built-in cgi
module used to do):
>>> from mimeparse import parse_mime_type
>>> parse_mime_type("application/json; charset")
('application', 'json', {})
stdlib email.policy.EmailPolicy
(tested using code from this SO answer):
>>> def parse_content_type(content_type):
... from email.policy import EmailPolicy
... header = EmailPolicy.header_factory('content-type', content_type)
... return (header.content_type, dict(header.params))
...
>>> parse_content_type("application/json; charset")
('application/json', {'charset': ''})
stdlib email.message.Message
(tested using code from this SO answer):
>>> from email.message import Message
>>>
>>> _CONTENT_TYPE = "content-type"
>>>
>>> def parse_content_type(content_type: str) -> tuple[str, dict[str,str]]:
... email = Message()
... email[_CONTENT_TYPE] = content_type
... params = email.get_params()
... # The first param is the mime-type the later ones are the attributes like "charset"
... return params[0][0], dict(params[1:])
...
>>> parse_content_type("application/json; charset")
('application/json', {'charset': ''})
(Also, checking those implementations, you can see that they are more correct about quoted strings -- matching quotes/unquoting -- but requests' simpler version of just splitting on ";" and stripping any quote characters has been around for a long time and apparently not caused problems, so...)