requests `requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present.

`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present.

Open iulian-birlica opened this issue 3 years ago • 2 comments

The http rfc (specifically section 4.2) allows multiple headers with the same name to be present, granted that they can be collapsed into a single header with comma separated values.

When encountered with a response containing these headers:

Content-Type: text/html; charset=UTF-8
Content-Type: text/javascript

requests parses them into a single dictionary value:

{'Content-Type': 'text/html; charset=UTF-8, text/javascript', ...}

but requests.utils._parse_content_type_header parses it wrong, by assuming that charset is UTF-8, text/javascript, instead of UTF-8

I expected requests to give the correct encoding, or at least one of the encodings from the two headers, preferably the first, UTF-8.

I get the encoding for the response set to UTF-8, text/javascript

Reproduction Steps

import requests

res = requests.get("url-that-responds-with-two-content-headers")
print(res.encoding)

System Information

$ python -m requests.help

{
  "chardet": {
    "version": "4.0.0"
  },
  "charset_normalizer": {
    "version": "2.0.12"
  },
  "cryptography": {
    "version": "36.0.2"
  },
  "idna": {
    "version": "3.3"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.2"
  },
  "platform": {
    "release": "4.9.0-16-amd64",
    "system": "Linux"
  },
  "pyOpenSSL": {
    "openssl_version": "101010ef",
    "version": "20.0.1"
  },
  "requests": {
    "version": "2.28.1"
  },
  "system_ssl": {
    "version": "101000cf"
  },
  "urllib3": {
    "version": "1.26.12"
  },
  "using_charset_normalizer": false,
  "using_pyopenssl": true
}

Seeing that converting the dict to a list would be much more trouble, my proposed fix is to ignore characters after the first comma. requests.utils._parse_content_type_header would start with tokens = header.split(',')[0].split(";").

I can make a PR, if this bug is deemed true and if that is the solution.

Sep 28 '22 22:09 iulian-birlica

A few things:

I don't believe that requests is responsible for collapsing the headers here. I think one of our dependencies needs to be more careful with folding.
Two Content-Type headers are not foldable if my memory serves correctly nor are the semantics of two Content-Type headers defined. Your assumption that utf8 is correct here is undefined behavior to the best of my knowledge. There's no clarity on this case

Content-Type: text/html; charset=cp-1252
Content-Type: text/plain; charset=utf-8

Which is the correct content type header? Those charsets are not compatible entirely. The server is sending mixed signals and shouldn't be sending two.

If we attempt to solve this, how do we handle N charsets from M Content-Type headers? Which one should be picked?

I agree currently it's not great but I don't think we should try to define behavior arbitrarily because that's going to break someone else. I would prefer we raise an exception if we can detect this case but that's slightly backwards incompatible as this might "Just Work" if the headers were reversed in the order it sends them.

Sep 29 '22 15:09 sigmavirus24

Perhaps a warning could be raised instead? This is both backwards compatible and the warning message can describe the sunsetting of this behaviour with the exception-raising behaviour being introduced in a future version?

Nov 26 '23 20:11 goelbenj

requests requests copied to clipboard

`requests.utils._parse_content_type_header` gives the wrong value when multiple headers with the same name are present.

Reproduction Steps

System Information

requests
requests copied to clipboard