WebSocket headers encoding problem
Checklist
- [x] The bug is reproducible against the latest release and/or
master. - [x] There are no similar issues or pull requests to fix it yet.
Describe the bug
If there is a websocket connection with a non-ascii char in header, the server will explode. The problem seems to be from here: https://github.com/encode/uvicorn/blob/9dc5a43209fe081ba3e74135189252bfddf75587/uvicorn/protocols/websockets/websockets_impl.py#L102 Also, I checked the ASGI doc and in section 2.2.5 it states:
- These are byte strings of the exact byte sequences sent by the client/to be sent by the server. While modern HTTP standards say that headers should be ASCII, older ones did not and allowed a wider range of characters. Frameworks/applications should decode headers as they deem appropriate.
So, I think uvicorn should allow for non-ascii chars in header. It is definitely the only reason that we can't use uvicorn in production yet.
To reproduce
Expected behavior
Doesn't explode.
Actual behavior
Explodes with an exception.
Debugging material
Environment
- OS / Python / Uvicorn version: just run
uvicorn --version - The exact command you're running uvicorn with, all flags you passed included. If you run it with gunicorn please do the same. If there is a reverse-proxy involved and you cannot reproduce without it please give the minimal config of it to reproduce.
Additional context
rfc specifically states non-ascii is forbidden https://tools.ietf.org/html/rfc7230#section-1.2
not sure what we should do here, briefly looked at hypercorn and they seem to also use ascii. curious what others think about this
A lot of tools and and frameworks put arbitrary data in the headers. For example in our instance our 2FA service provider puts the username in the header. And you can guess what kind of characters can appear in the usernames. uvicorn not being able to handle that means that we can not use it in real world scenarios. I'm not saying uvicorn should stop having ascii as the default charset there. I'm saying it should at least not explode. Maybe catch the exception and do something else then? Or there would be an option or a hook or some other way to customize it at least. Imagine that you want to deploy your application behind a proxy or a load balancer that doesn't respect ASCII encoding. Would you say that it's OK for uvicorn to not be able to handle it? If you are OK with it, we could try to come up with some PR. But I need a suggestion from the team which kind of solution would you prefer.
just played 5min with this, it seems this is only an issue with the websockets implementation, I could send a header with non ascii just fine with wsproto, which is surprising as they tend to be more strict, so you might use that meanwhile.
when you say explode, you mean getting a 500 right ?
encoding this to utf-8 should be ok, would love inputs from others as I'm not sure about unintended consequences of this, any idea @florimondmanca ?
@euri10 This is the same problem as non-conformant HTTP headers. I'd suggest taking a look at what tactics we adopt in HTTPX. Essentially anything non-ascii is off-spec, and you're guaranteed any particular behaviours. Possible tactics include:
- Strictly decode as ascii, loud fail on anything else.
- Strictly decode as some single-byte encoding that's a superset of ascii.
- Decode as UTF-8 first, then fallback to a single-byte encoding that's a superset of ascii.
I think we use the last one of those in HTTPX?
As websockets package represents non-ascii characters in headers using surrogate escapes, what do you guys think about this PR: https://github.com/encode/uvicorn/pull/1005?
Before anything else I'd like a review/reminder onto what we do wrt. headers in HTTPX. I recall putting a whole bunch of careful thinking into things there, and it's the same case here, so we probably want to follow the same path.
@tomchristie Is this the approach you mentioned as "Decode as UTF-8 first, then fall back to a single-byte encoding that's a superset of ascii."?
https://github.com/encode/httpx/blob/776dbb578b1fd69f84c5bf75a80cd74dc823a41c/httpx/_models.py#L855-L878
@Kludex @tomchristie
Greetings!
What have you decided on that?
Issue is still present in 0.18.3
@lavandosovich - I'd be happy to spend a little time thinking about this if we start from really basic first principles.
Can you give an example of an HTTP response that you'd like to be able to send, that you currently can't because of this limitation?
@tomchristie
the case is that i have opening handshake failed due to having non homogenous User-agent that invokes UnicodeEncodeError 'ascii' codec can't encode characters in position 205-206: ordinal not in range(128) that has been coined here as exploded server
frankly speaking i'm eager to be able to conduct handshake. that being said i'cant give you specific http response but i can give you specimen of user-agent that makes things dirty
'Mozilla/5.0 (Linux; Android 12; SM-A515F Build/SP1A.210812.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/107.0.5304.141 Mobile Safari/537.36 JsSdk/1.0 NetType/MOB\udcc4\udcb0LE Channel/googleplay'
How are you expecting the unicode string "Mozilla/5.0 (Linux; Android 12; SM-A515F Build/SP1A.210812.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/107.0.5304.141 Mobile Safari/537.36 JsSdk/1.0 NetType/MOB\udcc4\udcb0LE Channel/googleplay" to be represented as bytes-on-the-wire?
This is useful...
https://stackoverflow.com/questions/4400678/what-character-encoding-should-i-use-for-a-http-header
Particularly the pointer to the HTTPBIS spec (which is where some of the original HTTP designers try to clear up awkward edges)...
https://www.rfc-editor.org/rfc/rfc7230#section-3.2.4
A recipient MUST parse an HTTP message as a sequence of octets in an encoding that is a superset of US-ASCII [USASCII]. Parsing an HTTP message as a stream of Unicode characters, without regard for the specific encoding, creates security vulnerabilities due to the varying ways that string processing libraries handle invalid multibyte character sequences that contain the octet LF (%x0A). String-based parsers can only be safely used within protocol elements after the element has been extracted from the message, such as within a header field-value after message parsing has delineated the individual fields.
Sooooooooo...
Encoding as utf-8 seems like it could potentially be problematic. I can certainly see that it might unintentionally introduce a newline byte, and why there could be a security risk there.
RFC 2616's "Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047" is spec-compliant, but I'd expect it to be useless in practise, because the MIME encoding is just weird and awkward, and probably not widely supported at all.
From a look at gunicorn I think most WSGI frameworks and servers enforce latin-1 for headers and will hard fail otherwise...
Source code: https://github.com/javabrett/gunicorn/blob/60d0474a6f5604597180f435a6a03b016783885b/gunicorn/http/wsgi.py#L322
So... other python web servers would break on this behaviour too.
Which would leave us with two choices...
- Continue to break hard. This is on you. Stop trying to send broken header values.
- Use
.encode("ascii", errors="...')with either"ignore","replace","backslashreplace", or"surrogateescape".
https://docs.python.org/3/library/codecs.html#codec-base-classes
ignore (not reversible)
b'Mozilla/5.0 (Linux; Android 12; SM-A515F Build/SP1A.210812.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/107.0.5304.141 Mobile Safari/537.36 JsSdk/1.0 NetType/MOBLE Channel/googleplay'
replace (not reversible)
b'Mozilla/5.0 (Linux; Android 12; SM-A515F Build/SP1A.210812.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/107.0.5304.141 Mobile Safari/537.36 JsSdk/1.0 NetType/MOB??LE Channel/googleplay'
backslashreplace (reversible)
b'Mozilla/5.0 (Linux; Android 12; SM-A515F Build/SP1A.210812.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/107.0.5304.141 Mobile Safari/537.36 JsSdk/1.0 NetType/MOB\\udcc4\\udcb0LE Channel/googleplay'
surrogateescape (reversible)
b'Mozilla/5.0 (Linux; Android 12; SM-A515F Build/SP1A.210812.016; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/107.0.5304.141 Mobile Safari/537.36 JsSdk/1.0 NetType/MOB\xc4\xb0LE Channel/googleplay'
@tomchristie since websockets is decoding using surrogateescape: https://github.com/aaugustin/websockets/blob/3401751d4f1200d398c5e69f9c4a0ece862ff36b/src/websockets/legacy/http.py#L169
We can encode with surrogateescape, and reopen https://github.com/encode/uvicorn/pull/1005. Wdyt?
I also faced this problem, using uvicorn under NGINX proxy. NGINX bypasses unicode headers without errors.
In my case I sometimes get Origin as domain on local language or User-Agent header, with unicode characters, which are translated to '\uxxx'. Header example:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5 (Erg\udce4nzendes Update)) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.1.1 Safari/605.1.15
The real world problem is, that I can do nothing with these errors. These headers are formed by user browsers, handled by websocket implementation before I can handle it anyhow in FastAPI code. So I can not raise correct validation error or anything else...
I agree with threads' author that there should be some kind of callback/hook that will help developers to process these errors and make respsective correct API response
In my case, I replaced the User-Agent header (the header with the UTF-8 character set) with an empty string ("") in the Nginx configuration. The issue with the header was related to the Android application name, which consisted of three Persian words.
config for do this:
proxy_set_header User-Agent "";
Yes, this can be done so though User-Agent is not the only header with this problem. And we can not replace them all. Some of the headers are useful for backend and can not be just removed.