http-extensions
http-extensions copied to clipboard
[rfc6265bis] Cookie parser - UTF-8 chars
Following the RFC6265bis section 4.1.1 a cookie-value can contain only ASCII chars:
cookie-name = token
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
; US-ASCII characters excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
Modern browsers do allow UTF-8 chars too and we have web-platform-tests to enforce this behavior. I would suggest extending the cookie-octet with the range %x80-FF.
Thanks for raising this, I think this is basically https://github.com/whatwg/html/issues/804 again, for which we had #159, but that didn't cover everything that was identified there.
cc @inikulin
I think this is basically whatwg/html#804 again, for which we had #159, but that didn't cover everything that was identified there.
Correct. Though, at the time of testing only FF and Chrome supported UTF-8 properly, so it wasn't considered as a de-facto standard.
Non-ASCII cookies are certainly produced by servers out there in the wild, and accepted by several user agents (though not, apparently, Safari: https://wpt.fyi/results/cookies/http-state/charset-tests.html?label=experimental&label=master&aligned). I'm comfortable extending based on those test results.
(That said, I don't think it's as simple as extending cookie-octet
. We'll need to define the encoding, serialization, etc. in a little more detail if we're going to move outside of ASCII.)
Yeah, at the lowest level you're looking at two parsers, one that takes bytes and one that takes code points (potentially scalar values if we can make document.cookie
a USVString
). Which one feeds into which and how requires investigation.
Just a random comment - I would buy both of you some drinks if you could figure out a way to express the [Set-]Cookie header data model in terms of structured ~~headers~~ fields, and the define parsing and serialisation algorithms to get there...
It seems efaa468e6af574e88152cc365deb32e262444844 (marked editorial) changed this to include 0x80-0xFF and that matches Chrome in that it does bytes-in-bytes-out. Firefox tries to decode to UTF-8 (so 0xFF becomes U+FFFD and three bytes in the Cookie header rather than staying 0xFF). Safari seems to truncate at the first non-ASCII byte which strikes me as pretty bad behavior?
The tests mentioned above should account for invalid UTF-8 as it seems they currently do not and therefore do not capture this difference between Chrome and Firefox. I suspect we should align Firefox with Chrome here unless there is interest to do the opposite?
(Firefox has seen at least one compatibility problem from not having the Chrome behavior here: https://bugzilla.mozilla.org/show_bug.cgi?id=1664702.)
On it, more soon
Re-confirming that I'm working on this
@martinthomson wonders how IDN names should be encoded in domain attribute (maybe via another issue).
Why was this deferred? This is an issue that needs addressing.
This was discussed during the Sept 2021 Interim meeting. The decision that, while this is important, it's a lot of work and 6265bis shouldn't block on it.
https://github.com/httpwg/wg-materials/blob/gh-pages/interim-21-09/minutes.md#issue-1073---utf-8-characters
That discussion made the assumption that the current byte range is 0x00 to 0x7F, inclusive, right? But that's not what the draft specifies as noted in https://github.com/httpwg/http-extensions/issues/1073#issuecomment-710012476 (nor does it match the user agent requirements). I.e., a non-decision here endorses Chrome's model which is that cookies values are a byte sequence with some bytes prohibited. And in turn that means that some cookies end up not being representable in APIs (some of which are Chrome-only...).
To be clear, I would be okay with that, but let's at least make it clear that's what's happening here and not pretend this can be postponed.
Good find. As an observer I would say that changing the actual cookie ABNF to allow more values is not editorial, and it seems the change slipped into an entirely unrelated commit. (In particular the comment in the ABNF is now inconsistent with the actual ABNF).
@mikewest was adding the expanded octet range intentional in efaa468?
Ping @mikewest ...
No, that's clearly me screwing something up between multiple changes that I probably had in-flight at the time. Let's revert that bit.
We still have to have the conversation, but we should have it from the right baseline.
https://github.com/httpwg/http-extensions/pull/2090
While that PR removes a contradiction in the server requirements section, the user agent requirements still reflect the byte sequence model (and I think the tests do as well). I think that means my prior comment still stands and this should not be deferred.
Here to add that, even with that patch, there are characters that user agents allow in cookies that the draft RFC still says are incorrect, such as commas and backslashes. Section 5 does seem to be correct, the issue remains only with section 4.
Also note that Golang's behavior is misaligned both with what Chromium (and other user agents) allow as well as with what the draft RFC allows.