hyper icon indicating copy to clipboard operation
hyper copied to clipboard

HTTPHeaderMap naively splits on commas

Open vfaronov opened this issue 7 years ago • 10 comments

hyper.common.headers.canonical_form:

the header is split on commas unless for any reason it's a super-special snowflake (I'm looking at you Set-Cookie)

This seems to be inspired by text from RFC 7230 § 3.2.2, but that permits joining field-values with commas, not splitting them.

This is most readily obvious with the Date header:

import hyper
conn = hyper.HTTPConnection('nghttp2.org:443')
conn.request('GET', '/')
print(conn.get_response().headers['Date'])
# prints: [b'Sat', b'04 Mar 2017 18:23:26 GMT']

But there’s a wealth of other examples:

WWW-Authenticate: Digest realm="test", qop="auth", nonce="12345"
Cache-Control: private, no-cache="Some-Header,Other-Header"
Warning: 299 - "something may be wrong, be careful"

One could do b','.join(headers[name]) to get back to a value that is mostly correct with regard to commas (modulo Set-Cookie and except for lost whitespace). But maybe the API would be better if HTTPHeaderMap.__getitem__ returned a correct value by default, while splitting on commas were an explicit operation.

See also Werkzeug’s header parsing utilities, though they are far from 100% correct as well.

vfaronov avatar Mar 04 '17 18:03 vfaronov

Yeah, so we should definitely aim to do better here. Ideally, we'd adjust the API to only canonicalise headers we know that allow a list of values (which is what that section of the specification allows for). I'd be entirely happy to merge a change that restricts the canonicalisation to only the values we know can safely be canonicalised (which will cover all of the internal use-cases we need).

How does that sound?

Lukasa avatar Mar 04 '17 20:03 Lukasa

That sounds OK, but there’s a further tradeoff to be made.

Ideally, we'd adjust the API to only canonicalise headers we know that allow a list of values (which is what that section of the specification allows for).

See, Cache-Control and Warning are lists of values alright (as in RFC 7230 § 7) — problem is that the values themselves can contain commas, mostly by way of the RFC 7230 quoted-string.

canonical_form could be made to take quoted-string into account instead of naively splitting on comma, but that may involve more code and/or a performance hit: Werkzeug does this by invoking an undocumented state machine in urllib.

Or the whitelist could be restricted to those headers that really don’t allow commas beyond the top-level list. But note that even something as harmless looking as Transfer-Encoding can theoretically have a value of my-coding; param="foo, bar".

Or decide that it’s OK to break marginal things (like the above my-coding) and simply document how to get the full value if necessary.

This is also a question of how much public API to break. You could omit Cache-Control from the whitelist (there are servers in the wild that send things like the one I showed), but then it breaks for people who depend on headers['cache-control'] giving them nicely split directives.

vfaronov avatar Mar 04 '17 21:03 vfaronov

Yeah, we'd probably have to handle the fact that quoted strings exist. As a halfway house we could probably initially only canonicalise when we don't have a quoted string at all, and when we do just leave it alone. How does that sound?

Lukasa avatar Mar 04 '17 21:03 Lukasa

only canonicalise when we don't have a quoted string at all

Do you mean “when the given value does not contain quoted strings” or “when the header’s grammar does not allow quoted strings”?

vfaronov avatar Mar 04 '17 21:03 vfaronov

When the given value does not contain quoted strings.

Lukasa avatar Mar 04 '17 22:03 Lukasa

I guess that would work (in conjunction with a whitelist that excludes things like Date or WWW-Authenticate), but the API would be kinda wonky. I mean, Cache-Control: public, max-age=86400 gets split, but Cache-Control: public, max-age="86400" doesn’t?..

(Please note that I don’t have a stake here. I only raised this issue because it caught my eye.)

vfaronov avatar Mar 04 '17 22:03 vfaronov

@vfaronov Yeah, so it's definitely an annoyance, but that's the simplest approach to it that doesn't involve more aggressive checking. For example, a thing we could do is use a regular expression to try to detect "commas inside double quotes", but at that point we should just write a parser and I think by that time we're over-solving this problem. :grin:

Lukasa avatar Mar 06 '17 09:03 Lukasa

any patch available for this one??? :grin:

persiaAziz-zz avatar Apr 10 '17 22:04 persiaAziz-zz

Nope.

Lukasa avatar Apr 11 '17 06:04 Lukasa

@persiaAziz you can always make some monkey-patch; for this case I used dict from requests lib:

from requests.structures import CaseInsensitiveDict

response = self.connection.get_response()
response.headers = CaseInsensitiveDict(response.headers.iter_raw())

CaseInsensitiveDict have the same features, but not splits by comas

ech0-py avatar May 30 '17 21:05 ech0-py