http-extensions icon indicating copy to clipboard operation
http-extensions copied to clipboard

[rfc6265bis] Cookie parser - UTF-8 chars

Open bakulf opened this issue 5 years ago • 20 comments

Following the RFC6265bis section 4.1.1 a cookie-value can contain only ASCII chars:

   cookie-name       = token
   cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
   cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                         ; US-ASCII characters excluding CTLs,
                         ; whitespace DQUOTE, comma, semicolon,
                         ; and backslash

Modern browsers do allow UTF-8 chars too and we have web-platform-tests to enforce this behavior. I would suggest extending the cookie-octet with the range %x80-FF.

bakulf avatar Feb 24 '20 15:02 bakulf

Thanks for raising this, I think this is basically https://github.com/whatwg/html/issues/804 again, for which we had #159, but that didn't cover everything that was identified there.

cc @inikulin

annevk avatar Feb 25 '20 16:02 annevk

I think this is basically whatwg/html#804 again, for which we had #159, but that didn't cover everything that was identified there.

Correct. Though, at the time of testing only FF and Chrome supported UTF-8 properly, so it wasn't considered as a de-facto standard.

inikulin avatar Mar 06 '20 15:03 inikulin

Non-ASCII cookies are certainly produced by servers out there in the wild, and accepted by several user agents (though not, apparently, Safari: https://wpt.fyi/results/cookies/http-state/charset-tests.html?label=experimental&label=master&aligned). I'm comfortable extending based on those test results.

mikewest avatar Mar 31 '20 15:03 mikewest

(That said, I don't think it's as simple as extending cookie-octet. We'll need to define the encoding, serialization, etc. in a little more detail if we're going to move outside of ASCII.)

mikewest avatar Mar 31 '20 16:03 mikewest

Yeah, at the lowest level you're looking at two parsers, one that takes bytes and one that takes code points (potentially scalar values if we can make document.cookie a USVString). Which one feeds into which and how requires investigation.

annevk avatar Mar 31 '20 16:03 annevk

Just a random comment - I would buy both of you some drinks if you could figure out a way to express the [Set-]Cookie header data model in terms of structured ~~headers~~ fields, and the define parsing and serialisation algorithms to get there...

mnot avatar Apr 01 '20 00:04 mnot

It seems efaa468e6af574e88152cc365deb32e262444844 (marked editorial) changed this to include 0x80-0xFF and that matches Chrome in that it does bytes-in-bytes-out. Firefox tries to decode to UTF-8 (so 0xFF becomes U+FFFD and three bytes in the Cookie header rather than staying 0xFF). Safari seems to truncate at the first non-ASCII byte which strikes me as pretty bad behavior?

The tests mentioned above should account for invalid UTF-8 as it seems they currently do not and therefore do not capture this difference between Chrome and Firefox. I suspect we should align Firefox with Chrome here unless there is interest to do the opposite?

(Firefox has seen at least one compatibility problem from not having the Chrome behavior here: https://bugzilla.mozilla.org/show_bug.cgi?id=1664702.)

annevk avatar Oct 16 '20 12:10 annevk

On it, more soon

arichiv avatar Sep 14 '21 19:09 arichiv

Re-confirming that I'm working on this

arichiv avatar Sep 30 '21 20:09 arichiv

@martinthomson wonders how IDN names should be encoded in domain attribute (maybe via another issue).

miketaylr avatar Sep 30 '21 22:09 miketaylr

Why was this deferred? This is an issue that needs addressing.

annevk avatar Apr 29 '22 08:04 annevk

This was discussed during the Sept 2021 Interim meeting. The decision that, while this is important, it's a lot of work and 6265bis shouldn't block on it.

https://github.com/httpwg/wg-materials/blob/gh-pages/interim-21-09/minutes.md#issue-1073---utf-8-characters

sbingler avatar Apr 29 '22 18:04 sbingler

That discussion made the assumption that the current byte range is 0x00 to 0x7F, inclusive, right? But that's not what the draft specifies as noted in https://github.com/httpwg/http-extensions/issues/1073#issuecomment-710012476 (nor does it match the user agent requirements). I.e., a non-decision here endorses Chrome's model which is that cookies values are a byte sequence with some bytes prohibited. And in turn that means that some cookies end up not being representable in APIs (some of which are Chrome-only...).

To be clear, I would be okay with that, but let's at least make it clear that's what's happening here and not pretend this can be postponed.

annevk avatar May 02 '22 09:05 annevk

Good find. As an observer I would say that changing the actual cookie ABNF to allow more values is not editorial, and it seems the change slipped into an entirely unrelated commit. (In particular the comment in the ABNF is now inconsistent with the actual ABNF).

reschke avatar May 02 '22 10:05 reschke

@mikewest was adding the expanded octet range intentional in efaa468?

sbingler avatar May 02 '22 22:05 sbingler

Ping @mikewest ...

reschke avatar May 04 '22 15:05 reschke

No, that's clearly me screwing something up between multiple changes that I probably had in-flight at the time. Let's revert that bit.

We still have to have the conversation, but we should have it from the right baseline.

mikewest avatar May 04 '22 15:05 mikewest

https://github.com/httpwg/http-extensions/pull/2090

mikewest avatar May 04 '22 15:05 mikewest

While that PR removes a contradiction in the server requirements section, the user agent requirements still reflect the byte sequence model (and I think the tests do as well). I think that means my prior comment still stands and this should not be deferred.

annevk avatar May 05 '22 07:05 annevk

Here to add that, even with that patch, there are characters that user agents allow in cookies that the draft RFC still says are incorrect, such as commas and backslashes. Section 5 does seem to be correct, the issue remains only with section 4.

Also note that Golang's behavior is misaligned both with what Chromium (and other user agents) allow as well as with what the draft RFC allows.

april avatar Oct 12 '22 22:10 april