requests
requests copied to clipboard
The Content-Length header for string `data` counts Unicode characters in the string when it should count encoded bytes
A call like this:
response = requests.post("https://example.com", data="👍👎")
auto sets the Content-Length
header in the request to 2
when it should be 8
.
I hit this issue was making a request with a JSON body to a service I own (running behind AWS API Gateway) and having the service complain that there was no closing brace }
in the JSON body. I was passing the JSON body into requests as a string to the data
argument. It turns out that API Gateway ignores any body bytes beyond the Content-Length
in the request. Turning up detailed logging on API Gateway, I can see the request headers and realized the value in the Content-Length
header didn't match the number of bytes in the body.
A quick workaround is to encode the string into bytes before passing it into Requests.
This produces a Content-Length header with the correct value of 8
:
response = requests.post("https://example.com", data="👍👎".encode("utf-8"))
Expected Result
On a server receiving a POST from Requests, I expect the Content-Length
header value to match the number of bytes in the body of the request. See RFC 9110.
Actual Result
In the specific case where Request's data
argument is set as a string containing characters which encode into multi-byte UTF-8, the value in the Content-Length
header is incorrect. Requests appears to be counting the number of Unicode characters in the string instead of the number of bytes that will be sent to the server.
Reproduction Steps
>>> import requests
>>> thumbs_up_down = "👍👎"
>>> len(thumbs_up_down)
2
>>> len(thumbs_up_down.encode())
8
>>> pending_request = requests.Request("POST", "https://example.com", data=thumbs_up_down)
>>> prepared_request = pending_request.prepare()
>>> prepared_request.headers
{'Content-Length': '2'}
I opened a pull request, #6587, that adds a failing unit test that demonstrates this problem.
System Information
$ python -m requests.help
{
"chardet": {
"version": null
},
"charset_normalizer": {
"version": "3.3.2"
},
"cryptography": {
"version": ""
},
"idna": {
"version": "3.6"
},
"implementation": {
"name": "CPython",
"version": "3.9.16"
},
"platform": {
"release": "23.1.0",
"system": "Darwin"
},
"pyOpenSSL": {
"openssl_version": "",
"version": null
},
"requests": {
"version": "2.31.0"
},
"system_ssl": {
"version": "1010111f"
},
"urllib3": {
"version": "2.1.0"
},
"using_charset_normalizer": true,
"using_pyopenssl": false
}
I assume that it is incorrect to require the data to be encoded as UTF-8, so I will work on a fix that removes the need for this hack. @bruceadams
I assume that it is incorrect to require the data to be encoded as UTF-8, so I will work on a fix that removes the need for this hack. @bruceadams
I do not understand what you are saying. What hack?
A Python string can contain Unicode characters. To send a Python string as the body of an HTTP request, the string needs to be encoded into bytes. UTF-8 is a common encoding (and I see signs of UTF-8 being assumed elsewhere in the Requests code). In the behavior I saw in the wild, Requests did, in fact, encode the request body as UTF-8.
Ah! Your pull request lines up with how I thought this might be properly addressed! Nice! (I just created a similar pull request #6589.)
I do not understand what you are saying. What hack?
Ha, looks like we made the same conclusion here. What I meant regarding the "hack" was requiring the user to encode their string data as UTF-8 for the Content-Length header to be correctly initialized.
Can I fix this by downgrading to a previous version? I don't want to (and some users probably cannot) change the code to convert to bytes before passing it to the request.
Also don't really get your fixes, the body is at some point converted to bytes (there is a body_to_chunks
in request.py) that also seems to set the content-length header? But that is just a side note, I'm not into the code, so just ignore it if I'm talking nonsense..
Bytes are the language of the Internet regardless of whether you think that. Many things try to paper over that. The right thing is to typically send bytes that you know how they should be encoded but barring that, we should be always dealing with bytes internally. Now that we dropped 2.7 support, I'd support always encoding data parameters that are strs to bytes before doing anything else with them (e.g., calculating content length) internally
Fixed by #6589