requests The Content-Length header for string `data` counts Unicode characters in the string when it should count encoded bytes

A call like this:

response = requests.post("https://example.com", data="👍👎")

auto sets the Content-Length header in the request to 2 when it should be 8.

I hit this issue was making a request with a JSON body to a service I own (running behind AWS API Gateway) and having the service complain that there was no closing brace } in the JSON body. I was passing the JSON body into requests as a string to the data argument. It turns out that API Gateway ignores any body bytes beyond the Content-Length in the request. Turning up detailed logging on API Gateway, I can see the request headers and realized the value in the Content-Length header didn't match the number of bytes in the body.

A quick workaround is to encode the string into bytes before passing it into Requests.

This produces a Content-Length header with the correct value of 8:

response = requests.post("https://example.com", data="👍👎".encode("utf-8"))

Expected Result

On a server receiving a POST from Requests, I expect the Content-Length header value to match the number of bytes in the body of the request. See RFC 9110.

Actual Result

In the specific case where Request's data argument is set as a string containing characters which encode into multi-byte UTF-8, the value in the Content-Length header is incorrect. Requests appears to be counting the number of Unicode characters in the string instead of the number of bytes that will be sent to the server.

Reproduction Steps

>>> import requests
>>> thumbs_up_down = "👍👎"
>>> len(thumbs_up_down)
2
>>> len(thumbs_up_down.encode())
8
>>> pending_request = requests.Request("POST", "https://example.com", data=thumbs_up_down)
>>> prepared_request = pending_request.prepare()
>>> prepared_request.headers
{'Content-Length': '2'}

I opened a pull request, #6587, that adds a failing unit test that demonstrates this problem.

System Information

$ python -m requests.help

{
  "chardet": {
    "version": null
  },
  "charset_normalizer": {
    "version": "3.3.2"
  },
  "cryptography": {
    "version": ""
  },
  "idna": {
    "version": "3.6"
  },
  "implementation": {
    "name": "CPython",
    "version": "3.9.16"
  },
  "platform": {
    "release": "23.1.0",
    "system": "Darwin"
  },
  "pyOpenSSL": {
    "openssl_version": "",
    "version": null
  },
  "requests": {
    "version": "2.31.0"
  },
  "system_ssl": {
    "version": "1010111f"
  },
  "urllib3": {
    "version": "2.1.0"
  },
  "using_charset_normalizer": true,
  "using_pyopenssl": false
}

Nov 27 '23 22:11 bruceadams

I assume that it is incorrect to require the data to be encoded as UTF-8, so I will work on a fix that removes the need for this hack. @bruceadams

Nov 28 '23 18:11 goelbenj

I assume that it is incorrect to require the data to be encoded as UTF-8, so I will work on a fix that removes the need for this hack. @bruceadams

I do not understand what you are saying. What hack?

A Python string can contain Unicode characters. To send a Python string as the body of an HTTP request, the string needs to be encoded into bytes. UTF-8 is a common encoding (and I see signs of UTF-8 being assumed elsewhere in the Requests code). In the behavior I saw in the wild, Requests did, in fact, encode the request body as UTF-8.

Nov 28 '23 18:11 bruceadams

Ah! Your pull request lines up with how I thought this might be properly addressed! Nice! (I just created a similar pull request #6589.)

Nov 28 '23 18:11 bruceadams

I do not understand what you are saying. What hack?

Ha, looks like we made the same conclusion here. What I meant regarding the "hack" was requiring the user to encode their string data as UTF-8 for the Content-Length header to be correctly initialized.

Nov 28 '23 18:11 goelbenj

Can I fix this by downgrading to a previous version? I don't want to (and some users probably cannot) change the code to convert to bytes before passing it to the request.

Also don't really get your fixes, the body is at some point converted to bytes (there is a body_to_chunks in request.py) that also seems to set the content-length header? But that is just a side note, I'm not into the code, so just ignore it if I'm talking nonsense..

Nov 29 '23 12:11 numblr

Bytes are the language of the Internet regardless of whether you think that. Many things try to paper over that. The right thing is to typically send bytes that you know how they should be encoded but barring that, we should be always dealing with bytes internally. Now that we dropped 2.7 support, I'd support always encoding data parameters that are strs to bytes before doing anything else with them (e.g., calculating content length) internally

Nov 29 '23 13:11 sigmavirus24

Fixed by #6589

Feb 23 '24 01:02 bruceadams

requests requests copied to clipboard

The Content-Length header for string `data` counts Unicode characters in the string when it should count encoded bytes

Expected Result

Actual Result

Reproduction Steps

System Information

requests
requests copied to clipboard