http-extensions RFC 6265bis: Set-Cookie parsing algorithm should enforce more of the syntax requirements

There is an ABNF syntax given for the Set-Cookie header in section 4.4.1, however many aspects of this are not enforced in the actual parsing algorithm in section 5.3.

The syntax given in 4.4.1 is only a "SHOULD", not a hard requirement for the server:

Servers SHOULD NOT send Set-Cookie headers that fail to conform to
   the following grammar:

There is an explicit caveat about this in section 5.3:

   NOTE: The algorithm below is more permissive than the grammar in
   Section 4.1.  For example, the algorithm strips leading and trailing
   whitespace from the cookie name and value (but maintains internal
   whitespace), whereas the grammar in Section 4.1 forbids whitespace in
   these positions.  User agents use this algorithm so as to
   interoperate with servers that do not follow the recommendations in
   Section 4.

The algorithm in 5.3 should enforce more of the requirements given in the recommended syntax, at the very least the requirements on validity of characters/octets (whitespace, control characters, etc.), and specify the intended behavior in these cases. (For context and examples of such discrepancies, see here and here.)

Feb 04 '21 22:02 chlily1

To call out some of the current differences between the ABNF section and the parsing algorithm (not an exhaustive list - just the ones I've happened to notice):

cookie-pair       = cookie-name BWS "=" BWS cookie-value
cookie-name       = 1*cookie-octet
cookie-value      = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )

cookie-name can be empty
cookie-pair doesn't have to include the '=' (Ex: cookie with empty name, non-empty cookie value)
Double quotes surrounding cookie-octet in cookie-value are not removed and will be a part of the cookie-value if present

cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                    / %x80-FF
                      ; octets excluding CTLs,
                      ; whitespace DQUOTE, comma, semicolon,
                      ; and backslash

The character set used by cookie-name is %x09 / %x20-3A / %x3C / %x3E-7E / %x80-FF (all octets except for non-whitespace CTLs, ;, and =)
The character set used by cookie-value is %x09 / %x20-3A / %x3C-7E / %x80-FF (all octets except for non-whitespace CTLs and ;)
Whitespace gets trimmed from the beginning and end

max-age-av        = "Max-Age" BWS "=" BWS non-zero-digit *DIGIT
                      ; In practice, both expires-av and max-age-av
                      ; are limited to dates representable by the
                      ; user agent.
non-zero-digit    = %x31-39
                      ; digits 1 through 9

The first digit can be 0 or a negative sign

domain-value      = <subdomain>
                      ; defined in [RFC1034], Section 3.5, as
                      ; enhanced by [RFC1123], Section 2.1

The RFCs for domains indicate that the domain name should not exceed 253 characters, and also indicates that labels (the pieces separated by dots in a domain) should not be more than 63 characters each. The parsing algorithm doesn't indicate that these restrictions must be enforced, but instead applies the 1024 octet limit for attribute values as an upper bound on the length of domain-value.

extension-av      = *av-octet
av-octet          = %x20-3A / %x3C-7E
                      ; any CHAR except CTLs or ";"

the av-octet character set is %x09 / %x20-3A / %x3C-7E / %x80-FF (all octets except for non-whitespace CTLs and ;)

Aug 16 '21 18:08 recvfrom

I've observed multiple instances where the grammar from 4.4.1 is copied into discussions/code related to cookie parsing and used as if it was the syntax that all cookies MUST adhere to. To mitigate this confusion, I think we should provide a grammar for section 5.3 either in addition to or in place of the one in 4.4.1.

Does anyone have any thoughts about whether there's still value in having the separate, stricter grammar for Set-Cookie headers sent by servers?

Sep 30 '21 20:09 recvfrom

Let's make sure anything changing here is well tested and matches reality.

Sep 30 '21 22:09 miketaylr

Just noticed another discrepancy in section 4.4.1:

 cookie-octet      = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
                       / %x80-FF
                         ; octets excluding CTLs,
                         ; whitespace DQUOTE, comma, semicolon,
                         ; and backslash

%x7F is also excluded but not mentioned in the comment: It's the question mark character ?

Mar 03 '22 14:03 kailuowang

I hadn't seen this issue before, but it's quite common to have strict requirements for producers and loose-but-still-strict requirements for consumers. E.g., CSS and HTML do something similar.

https://crbug.com/1174647 seems like a solid reason to make changes to the parser, but I think that's a separable concern from what this issue seems to raise as a larger issue, which I don't think is an issue at all.

Apr 29 '22 08:04 annevk

Stepping back, I see two aspects to this issue:

a. Whether there should be differences in the producer and consumer grammars, and b. If so, how to document them.

It seems like we've come to agreement here that there can be such differences. Not only do we see it in existing specs (e.g., Cookies so far, HTTP with productions like OWS, Structured Fields), it's inevitable - ABNF cannot capture all of the constraints that prose (or an algorithm) can express.

(b) is largely an editorial issue about how to capture the delta between the producer grammer and the parsing algorithm. As such, the editors should decide, taking into account input from the WG as they see fit.

Personally: I don't see how having two separate ABNF grammars would help -- it's just as likely to cause confusion, and there's a possilbity of introducing yet more errors. I think it'd be fine if the editors added a note to the top of 4.1.1 noting the relationship with 5.4, much as they already do in the other direction.

Aug 17 '22 00:08 mnot

As per above, marking as editorial.

Sep 01 '22 00:09 mnot