tornado icon indicating copy to clipboard operation
tornado copied to clipboard

Content-length validation does not handle spaces

Open chrisstaite-menlo opened this issue 2 years ago • 7 comments

Additional validation of Content-Length parsing was introduced in bf90f3a9ad150c45d9d17f189c3368e9a3b8e80f however, the value is not striped and therefore a value of '0 ' causes a ValueError.

This is a particular issue because if Content-Length is the last header in a request and parse_line is being called then the \r\n of the end of the header is interpreted as a multi-line continuation and appends the space to the end in HTTPHeaders.parse_line: new_part = " " + line.lstrip().

@bdarnell

chrisstaite-menlo avatar Sep 13 '23 10:09 chrisstaite-menlo

Can you say more about how exactly this happens? It's true that we don't strip the value when parsing content-length, but it's supposed to already be stripped in the last line of HTTPHeaders.parse_line.

The \r\n is not supposed to make it to parse_line; those characters are handled in parse(). I don't see an issue when Content-Length is the last header: we have a test for this case at https://github.com/tornadoweb/tornado/blob/a48d63446afdb384fa0f53ac707dea9d250dc183/tornado/httputil.py#L188.

I do see a couple of potential issues in edge cases, though.

  • Content-Length: 42\r\n \r\n (with a space between the CRLF pairs) will add a space to the value "42 "
  • Content-Length:\r\n 42\r\n (with the whole value in a continuation line) adds a leading space, " 42"

Both of these cases are errors now although they were accepted prior to bf90f3a. I think they're both technically legal although I'd have to go back to the RFCs to be sure.

bdarnell avatar Sep 24 '23 18:09 bdarnell

We had some code that was manually proxying headers from an upstream request to a response that was pushing all of the lines passed to a AsyncHTTPClient.fetch header_callback to parse_line that triggered this.

chrisstaite-menlo avatar Sep 27 '23 08:09 chrisstaite-menlo

I just tested sending a request with a Content-Length of 0 , and it worked totally fine. Can you enter an example of a request that causes the problem?

kenballus avatar Jan 31 '24 09:01 kenballus

The Content-Length needs to be the last header which then gets interpreted as a multi-line continuation and then adds a space itself, as stated in the first message.

chrisstaite-menlo avatar Jan 31 '24 09:01 chrisstaite-menlo

Got it; now I can reproduce the bug. Agreed that this is a problem.

Also, it turns out that gunicorn and fasthttp also have this exact same bug.

kenballus avatar Jan 31 '24 16:01 kenballus

Got it; now I can reproduce the bug. Agreed that this is a problem.

I'm still not clear on what exactly the problem is. Is there an issue with HTTPHeaders.parse() or only with parse_line()? Internally, Tornado only uses parse_line() inside parse() and in curl_httpclient's header callback.

I see that there's a design mismatch in the interfaces of header_callback and parse_line: the former gives you the newlines, while parse_line expects them to be removed (this isn't formally specified but it's implied by the doctest). So you can't actually pass the values from header_callback directly to parse_line, even though this is superficially a reasonable thing to do.

There's also a couple of weird edge cases I noted at the bottom of https://github.com/tornadoweb/tornado/issues/3321#issuecomment-1732638393

Does that cover everything or am I missing something?

Solutions to the design mismatch include:

  • Working as intended, just needs better docs
  • Deprecate header_callback in AsyncHTTPClient.fetch and replace it with a separate callback that gives you a pre-parsed HTTPHeaders object. We need a callback that gives you headers before the first streaming chunk, but doing it with raw header lines just pushes unnecessary work into the application.
  • Make parse_line able to handle newlines. This almost works (by accident) because simple headers get stripped, but continuation lines can cause extraneous whitespace.

bdarnell avatar Mar 03 '24 16:03 bdarnell

Aha, now I see the problem. Single-line headers have leading and trailing whitespace stripped, while continuation lines make it possible to construct a header with trailing whitespace, potentially confusing users of that header. RFC 9110 is clear that trailing whitespace should be stripped from header values. I'm going to:

  1. Make continuation lines containing only whitespace an error. The parse_line interface doesn't let us handle this properly (we must preserve internal space but strip trailing space, and we can't tell in the line-by-line interface whether we're looking at a middle line or the last one of a header)
  2. Handle newlines in parse_line, specifically so that lines containing only newlines are no-ops. This fixes the way that the last header gets a trailing space if you use parse_line directly instead of parse
  3. Emit a deprecation warning on continuation lines. There should be no reason to support this feature any more and we should get rid of it in the future.

bdarnell avatar Jun 05 '24 14:06 bdarnell