squid icon indicating copy to clipboard operation
squid copied to clipboard

ParserNG: Replace HttpHeader::parse with Tokenizer based parser

Open yadij opened this issue 3 years ago • 2 comments

Next stage of Parser updates.

Takes a block of headers in SBuf from Http1::Parser (and other callers with pre-isolated header blocks) and parse into field-name:field-value lines to generate the set of HttpHeaderEntry objects forming the blocks binary representation.

yadij avatar Aug 01 '20 12:08 yadij

I did not review details. Here are some high-level comments that might help you make progress with fewer iterations:

* Instead of passing string view, SBuf, Tokenizer, and similar low-level objects to various internal parsing methods, pass a ParsingContext object (the exact name is to be decided) that combines the SBuf buffer to be parsed with the HTTP parsing state (things like http_hdr_owner_type, ContentLengthInterpreter, etc.). One of the biggest design problems of the old code you are rewriting is that it has no place to keep parsing metadata; passing small snippets of that metadata via function parameters deprives lower-level methods from accessing critical high-level details they sometimes need to make the correct low-level parsing decision.

Understood. FYI, the context info related to parsing is intended to be held by the Parser object. In later PR a Parser instance may be passed to any of the low-level field processing logic. This PR is scoped as steps needed to integrate HttpHeader::parse() code with how Parser code operates ready for the two to be combined. The chunk documented as "stage 4" in Parser.cc should call HttpHeader::parse() or a Parser method doing the same.

* As you know from my recent Advisory [review](https://github.com/squid-cache/squid-ghsa-3365-q9qx-f98m/pull/1#pullrequestreview-457017523), I believe that conversion of header field values to SBuf (at any parsing stage) requires changing HttpHeaderEntry::value type to SBuf.

The thing is; most of the field-value handled by Squid are already in SBuf. Specifically the I/O buffer for whatever socket the message was received on, which gets a "view" of the buffer storage from Parser::mimeHeader(). The current master code pulls that out with c_str() to pass to HttpHeader::parse() [see the chunk in http/Message.cc changed by this PR].

This PR is scoped carefully such that at worst the field-value String/c-string is bug-compatible with existing code. But hopefully improved readability and documentation with the Tokenizer.

yadij avatar Nov 09 '20 04:11 yadij

the context info related to parsing is intended to be held by the Parser object. In later PR a Parser instance may be passed to any of the low-level field processing logic. This PR is scoped as steps needed to integrate HttpHeader::parse() code with how Parser code operates ready for the two to be combined.

That Parser code itself has many serious design-level problems today. I bet that, when combined, those bad/ugly/deficient code bases will not result in the right outcome! Instead, the combination will require another major rewrite. Thus, we will be fighting over a series of PRs, with all but the last fight being over non-technical unwinnable "trust me, it will all make sense at the end" arguments. Moreover, we will have to live with the bad results of the initial steps, waiting for that perfect unity moment to come. Judging by the existence of many serious design problems introduced under similar excuses years ago, that suffering may last a while.

Splitting a large rewrite into small steps is often a great idea, but doing so correctly is very difficult, far more difficult than doing a large rewrite in one PR!

This specific draft PR modifies a lot of code, but many of those modifications are not moving us in the right direction AFAICT. A lot of the modified (and difficult to review) code will most likely have to be rewritten (and reviewed) again during the next stages. This churn results in a lot of waste. I suspect that most of the changes in this PR go outside its declared scope. It is difficult for me to quickly isolate the essential changes, but I bet they will require a lot of work to get right as well.

I am not sure what the best way out of this mess is, but my recommendation is to pause code changes and start from the beginning: Let's try to agree on what the scope of this PR should be.

I think moving in small steps is the right overall approach for the next few PRs in the "improve parsing code" direction, but those steps must be much smaller and simpler than this PR, and they should not require significant future rewrites. I can suggest the following candidates (that will require further detailing/agreement):

  1. Deciding Http::One::Parser future scope/fate. Today, that class creates more problems than it solves.
  2. Move HttpHeaderEntry parsing outside of HttpHeaderEntry.
  3. Move HttpHeader parsing outside of HttpHeader.
  4. Changing HttpHeaderEntry::value type to SBuf
  5. Switching from string view parsing to Tokenizer-based parsing (in the parser classes introduced in previous steps).

The above list is just a sketch and it will be revised as we make progress, of course. At any moment in time, an agreement on the then-current first step is probably sufficient to make progress.

  • As you know from my recent Advisory review, I believe that conversion of header field values to SBuf (at any parsing stage) requires changing HttpHeaderEntry::value type to SBuf.

The thing is; most of the field-value handled by Squid are already in SBuf.

Please interpret my statement as if I agree with yours.

rousskov avatar Nov 09 '20 15:11 rousskov