aiohttp icon indicating copy to clipboard operation
aiohttp copied to clipboard

Change headers to a dict that parses comma-separated values

Open Dreamsorcerer opened this issue 2 years ago • 5 comments

This is a proposal to change the headers from a CIMultiDict to a more regular dict (in v4). The problem with the multidict approach is that list headers (i.e. headers that can have multiple values) can have values combined in single headers and/or split over multiple headers.


Basically, these 2 payloads should be considered equivalent:

Foo: 1
Foo: 2
Foo: 1, 2

But, currently aiohttp will produce in the first case:

headers["Foo"]  # "1"
headers.getall("Foo")  # ["1", "2"]

And, in the second case:

headers["Foo"]  # "1, 2"
headers.getall("Foo")  # ["1, 2"]

The spec recommends concatenating duplicate headers together with ", ". This is also what the vast majority of existing software does (including requests).

The only problem with this, is that if the user wants the values as a list, they are left to parse the value themselves, which when accounting for quoted values becomes quite complex and easy to get wrong on edge cases. So, in this proposal I've concatenated the values as recommended, but added a .getall() method which parses the final value to get the list.

With the code in this PR, both of the previous payloads produce the same output:

headers["Foo"]  # "1, 2"
headers.getall("Foo")  # ["1", "2"]

"Field value" now refers to the value after multiple field lines are combined with commas -- by far the most common use. From RFC 9110 appendix B2

From kenballus's testing:

Servers that join the duplicate headers by default: Apache httpd, Caddy, Gunicorn, H2O, IIS, Lighttpd, Nginx, Node.js, Puma. Servers that accept the duplicate headers without joining them: aiohttp (currently), Boost::Beast, Mongoose, Tornado

I've also seen that requests combines into a regular dict. The only other library I've seen that uses a multidict for this is Starlette.

Also: https://www.rfc-editor.org/rfc/rfc9110.html#name-recipient-requirements

Dreamsorcerer avatar Oct 08 '23 18:10 Dreamsorcerer

Some questions to consider here:

  1. Should the getAll list be converted to lowercase when the values are case-insensitive (which is true for most if not all I think)?
  2. If yes, should the list then also be de-duplicated?
  3. If yes, should content negotiation headers that can have quality values per RFC 9110 be parsed and assigned, e.g. by returning a dictionary instead of a list or tuple as {"<value>": <quality>, ...}?

I think all of these being done directly in the parsers is best for performance, and would make for easier and less error-prone usage.

steverep avatar Jan 31 '24 21:01 steverep

Some questions to consider here:

If you've got any information in the specs to answer those questions, that'd be great to have.

Dreamsorcerer avatar Feb 02 '24 13:02 Dreamsorcerer

If you've got any information in the specs to answer those questions, that'd be great to have.

Just by their nature, I think it's certainly safe to deduplicate the content negotiation fields defined in Section 12.5 of RFC 9110. However, AFAICT, the RFC has no guidance on what quality value to assign if the duplicates happen to disagree. Seems like server's choice in that edge case would be conformant.

Other list headers should not be deduplicated because duplicates can actually mean something. For example, Content-Encoding: "gzip, gzip" means the content was double compressed with gzip.

steverep avatar Feb 03 '24 23:02 steverep

OK, now having time to think over this, I think it's all a level above what we should be doing here. I think we're just providing a list based on the definition of a general HTTP field. So, I feel the answer is no to all 3 questions. Maintaining logic for all the different kind of headers seems out of scope to me (and if we did, it'd be through dedicated attributes, like cookies).

Dreamsorcerer avatar Feb 04 '24 00:02 Dreamsorcerer

OK, now having time to think over this, I think it's all a level above what we should be doing here. I think we're just providing a list based on the definition of a general HTTP field. So, I feel the answer is no to all 3 questions. Maintaining logic for all the different kind of headers seems out of scope to me (and if we did, it'd be through dedicated attributes, like cookies).

After reading the spec a bit more, I guess I agree with you on the first two, but parameters are actually generically defined in section 5.6.6. The syntax and case-insensitivity of parameter names is defined there (the content negotiation headers just happen to use "q" as the name). I think the parsers should provide a way to access them as a dictionary (maybe by returning a list subclass?).

steverep avatar Feb 04 '24 05:02 steverep