http-extensions
http-extensions copied to clipboard
RFC 6265bis: Set-Cookie parsing algorithm should enforce more of the syntax requirements
There is an ABNF syntax given for the Set-Cookie header in section 4.4.1, however many aspects of this are not enforced in the actual parsing algorithm in section 5.3.
The syntax given in 4.4.1 is only a "SHOULD", not a hard requirement for the server:
Servers SHOULD NOT send Set-Cookie headers that fail to conform to
the following grammar:
There is an explicit caveat about this in section 5.3:
NOTE: The algorithm below is more permissive than the grammar in
Section 4.1. For example, the algorithm strips leading and trailing
whitespace from the cookie name and value (but maintains internal
whitespace), whereas the grammar in Section 4.1 forbids whitespace in
these positions. User agents use this algorithm so as to
interoperate with servers that do not follow the recommendations in
Section 4.
The algorithm in 5.3 should enforce more of the requirements given in the recommended syntax, at the very least the requirements on validity of characters/octets (whitespace, control characters, etc.), and specify the intended behavior in these cases. (For context and examples of such discrepancies, see here and here.)
To call out some of the current differences between the ABNF section and the parsing algorithm (not an exhaustive list - just the ones I've happened to notice):
cookie-pair = cookie-name BWS "=" BWS cookie-value
cookie-name = 1*cookie-octet
cookie-value = *cookie-octet / ( DQUOTE *cookie-octet DQUOTE )
cookie-namecan be emptycookie-pairdoesn't have to include the '=' (Ex: cookie with empty name, non-empty cookie value)- Double quotes surrounding
cookie-octetincookie-valueare not removed and will be a part of thecookie-valueif present
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
/ %x80-FF
; octets excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
- The character set used by
cookie-nameis%x09 / %x20-3A / %x3C / %x3E-7E / %x80-FF(all octets except for non-whitespace CTLs,;, and=) - The character set used by
cookie-valueis%x09 / %x20-3A / %x3C-7E / %x80-FF(all octets except for non-whitespace CTLs and;) - Whitespace gets trimmed from the beginning and end
max-age-av = "Max-Age" BWS "=" BWS non-zero-digit *DIGIT
; In practice, both expires-av and max-age-av
; are limited to dates representable by the
; user agent.
non-zero-digit = %x31-39
; digits 1 through 9
- The first digit can be 0 or a negative sign
domain-value = <subdomain>
; defined in [RFC1034], Section 3.5, as
; enhanced by [RFC1123], Section 2.1
- The RFCs for domains indicate that the domain name should not exceed 253 characters, and also indicates that labels (the pieces separated by dots in a domain) should not be more than 63 characters each. The parsing algorithm doesn't indicate that these restrictions must be enforced, but instead applies the 1024 octet limit for attribute values as an upper bound on the length of
domain-value.
extension-av = *av-octet
av-octet = %x20-3A / %x3C-7E
; any CHAR except CTLs or ";"
- the
av-octetcharacter set is%x09 / %x20-3A / %x3C-7E / %x80-FF(all octets except for non-whitespace CTLs and;)
I've observed multiple instances where the grammar from 4.4.1 is copied into discussions/code related to cookie parsing and used as if it was the syntax that all cookies MUST adhere to. To mitigate this confusion, I think we should provide a grammar for section 5.3 either in addition to or in place of the one in 4.4.1.
Does anyone have any thoughts about whether there's still value in having the separate, stricter grammar for Set-Cookie headers sent by servers?
Let's make sure anything changing here is well tested and matches reality.
Just noticed another discrepancy in section 4.4.1:
cookie-octet = %x21 / %x23-2B / %x2D-3A / %x3C-5B / %x5D-7E
/ %x80-FF
; octets excluding CTLs,
; whitespace DQUOTE, comma, semicolon,
; and backslash
%x7F is also excluded but not mentioned in the comment: It's the question mark character ?
I hadn't seen this issue before, but it's quite common to have strict requirements for producers and loose-but-still-strict requirements for consumers. E.g., CSS and HTML do something similar.
https://crbug.com/1174647 seems like a solid reason to make changes to the parser, but I think that's a separable concern from what this issue seems to raise as a larger issue, which I don't think is an issue at all.
Stepping back, I see two aspects to this issue:
a. Whether there should be differences in the producer and consumer grammars, and b. If so, how to document them.
It seems like we've come to agreement here that there can be such differences. Not only do we see it in existing specs (e.g., Cookies so far, HTTP with productions like OWS, Structured Fields), it's inevitable - ABNF cannot capture all of the constraints that prose (or an algorithm) can express.
(b) is largely an editorial issue about how to capture the delta between the producer grammer and the parsing algorithm. As such, the editors should decide, taking into account input from the WG as they see fit.
Personally: I don't see how having two separate ABNF grammars would help -- it's just as likely to cause confusion, and there's a possilbity of introducing yet more errors. I think it'd be fine if the editors added a note to the top of 4.1.1 noting the relationship with 5.4, much as they already do in the other direction.
As per above, marking as editorial.