hyper
hyper copied to clipboard
HTTPHeaderMap naively splits on commas
hyper.common.headers.canonical_form
:
the header is split on commas unless for any reason it's a super-special snowflake (I'm looking at you Set-Cookie)
This seems to be inspired by text from RFC 7230 § 3.2.2, but that permits joining field-values with commas, not splitting them.
This is most readily obvious with the Date
header:
import hyper
conn = hyper.HTTPConnection('nghttp2.org:443')
conn.request('GET', '/')
print(conn.get_response().headers['Date'])
# prints: [b'Sat', b'04 Mar 2017 18:23:26 GMT']
But there’s a wealth of other examples:
WWW-Authenticate: Digest realm="test", qop="auth", nonce="12345"
Cache-Control: private, no-cache="Some-Header,Other-Header"
Warning: 299 - "something may be wrong, be careful"
One could do b','.join(headers[name])
to get back to a value that is mostly correct with regard to commas (modulo Set-Cookie
and except for lost whitespace). But maybe the API would be better if HTTPHeaderMap.__getitem__
returned a correct value by default, while splitting on commas were an explicit operation.
See also Werkzeug’s header parsing utilities, though they are far from 100% correct as well.
Yeah, so we should definitely aim to do better here. Ideally, we'd adjust the API to only canonicalise headers we know that allow a list of values (which is what that section of the specification allows for). I'd be entirely happy to merge a change that restricts the canonicalisation to only the values we know can safely be canonicalised (which will cover all of the internal use-cases we need).
How does that sound?
That sounds OK, but there’s a further tradeoff to be made.
Ideally, we'd adjust the API to only canonicalise headers we know that allow a list of values (which is what that section of the specification allows for).
See, Cache-Control
and Warning
are lists of values alright (as in RFC 7230 § 7) — problem is that the values themselves can contain commas, mostly by way of the RFC 7230 quoted-string
.
canonical_form
could be made to take quoted-string
into account instead of naively splitting on comma, but that may involve more code and/or a performance hit: Werkzeug does this by invoking an undocumented state machine in urllib
.
Or the whitelist could be restricted to those headers that really don’t allow commas beyond the top-level list. But note that even something as harmless looking as Transfer-Encoding
can theoretically have a value of my-coding; param="foo, bar"
.
Or decide that it’s OK to break marginal things (like the above my-coding
) and simply document how to get the full value if necessary.
This is also a question of how much public API to break. You could omit Cache-Control
from the whitelist (there are servers in the wild that send things like the one I showed), but then it breaks for people who depend on headers['cache-control']
giving them nicely split directives.
Yeah, we'd probably have to handle the fact that quoted strings exist. As a halfway house we could probably initially only canonicalise when we don't have a quoted string at all, and when we do just leave it alone. How does that sound?
only canonicalise when we don't have a quoted string at all
Do you mean “when the given value does not contain quoted strings” or “when the header’s grammar does not allow quoted strings”?
When the given value does not contain quoted strings.
I guess that would work (in conjunction with a whitelist that excludes things like Date
or WWW-Authenticate
), but the API would be kinda wonky. I mean, Cache-Control: public, max-age=86400
gets split, but Cache-Control: public, max-age="86400"
doesn’t?..
(Please note that I don’t have a stake here. I only raised this issue because it caught my eye.)
@vfaronov Yeah, so it's definitely an annoyance, but that's the simplest approach to it that doesn't involve more aggressive checking. For example, a thing we could do is use a regular expression to try to detect "commas inside double quotes", but at that point we should just write a parser and I think by that time we're over-solving this problem. :grin:
any patch available for this one??? :grin:
Nope.
@persiaAziz you can always make some monkey-patch; for this case I used dict from requests lib:
from requests.structures import CaseInsensitiveDict
response = self.connection.get_response()
response.headers = CaseInsensitiveDict(response.headers.iter_raw())
CaseInsensitiveDict have the same features, but not splits by comas