HTTP.jl icon indicating copy to clipboard operation
HTTP.jl copied to clipboard

WIP: Lazy Parsing

Open samoconnor opened this issue 5 years ago • 5 comments

This branch adds a number of modules (with passing tests) but does not integrate them into the HTTP request/response processing code.

Lazy parsing has potential performance and security benefits for HTTP clients and servers.

Consider a multi layered server request routing framework. The top level request handling function may only need to look at a single auth-token header to know that it should just close the connection. It can avoid waiting time parting the rest of the headers, which reduces the impact of DOS attacks (and it is immune to malformed header attacks that might be designed to waste memory and/or time processing headers.). If it is decided that a request is authorised, it may be that the top-level router only needs to look at the target URI to decide what to do next, so it does not need to bother parsing all the headers. This pattern can continue down the tree of handlers, so that each layer only needs to decode the parts of the header that are actually needed.

There are similar benefits for clients too. There are many cases where the client of a web services API call just wants to check for 200 OK and then read some json from the response body. The headers may be full of all kinds of meta-data from various intermediate proxies that are never used.

This PR is intended to be a first step at adding some components that can later be used to implement lazy HTTP Message processing.

(The so/lazyintegrate branch has a working (but not optimal) integration of the LazyHTTP Parser/Generator with Messages.jl.)

LazyStrings.jl

This module defines AbstractString methods for accessing sub-strings whose length is not known in advance. Length is lazily determined during iteration.

LazyHTTP.jl

This module defines RequestHeader and ResponseHeader types for lazy parsing of HTTP headers. RequestHeader has properties: method, target and version. ResponseHeader has properties: version and status. Both types have an AbstractDict-like interface for accessing header fields.

The implementation simply stores a reference to the input string. Parsing is deferred until the properties or header fields are accessed. The value objects returned by the parser are also lazy. They store a reference to the input string and the start index of the value. Parsing of the value content is deferred until needed by the AbstractString interface.

          ┌▶"GET / HTTP/1.1\\r\\n" *
          │ "Content-Type: text/plain\\r\\r\\r\\n"
          │  ▲          ▲
          │  │          │
FieldName(s, i=17)      │        == "Content-Type"
          └──────────┐  │
          FieldValue(s, i=28)    == "text/plain"

isvalid.jl

Base.isvalid(h::RequestHeader; obs=false) Base.isvalid(h::ResponseHeader; obs=false)

Regexs for checking validity of HTTP Headers (for cases where a lazy parser does not notice invalidity, but validity is important)

Nibbles.jl

Iterate over byte-vectors 4-bits at a time.

Used for decoding HPack's Huffman code.

HPack.jl

Lazy Parsing and String comparison for RFC7541 "HPACK Header Compression for HTTP/2".

samoconnor avatar Oct 16 '18 04:10 samoconnor

The CI is passing sometimes and timing out sometimes: https://travis-ci.org/JuliaWeb/HTTP.jl/builds/442040263?utm_source=github_status&utm_medium=notification

The HPack test takes 55MB of simulated HTTP/2 header streams from https://github.com/http2jp/hpack-test-case and processes them in a variety of ways to ensure that lazy random access works, and full iteration works, and random access after full iteration works, and vis versa... All this takes a bit of time. I might have to add an env var for HTTP_JL_RUN_FULL_HPACK_TEST...

samoconnor avatar Oct 17 '18 00:10 samoconnor

With MbedTLS hanging bug fixed, CI now passing.

samoconnor avatar Oct 22 '18 22:10 samoconnor

This is exciting! Looks like this needs a rebase and perhaps a little squashing along the way? Any recommendations on where to start reviewing or what to look for/worry about?

quinnj avatar Oct 24 '18 02:10 quinnj

Yeah there are most definitely a bunch of commits fiddling with CI/toml etc that I can squash. I can rebase and do that today if you're ready to take a look.

Any recommendations on where to start reviewing or what to look for/worry about?

I was thinking you should start by just doing a sanity check that this PR is really a no-op as far as current exported functionality goes. Aside from all-new files, the changes should be:

  • *.toml contains mysterious stuff that Pkg3 wants.
  • Little tweaks in IODbug.jl and AWS4AuthRequest.jl to be compatible with new structs.
  • Include new files in HTTP.jl (but the newly included modules don't export anything).

Aside from that I'm happy to answer questions about the new code if you have any. I've tried to put a reasonable amount of explanatory documentation in the code, but it would be good to know if there are places where stuff doesn't make sense.

samoconnor avatar Oct 24 '18 02:10 samoconnor

@samoconnor, this would be really great functionality to have, especially the http2 support. Will you be able to pick this back up? Otherwise, I could try to dive in and get it merged in.

quinnj avatar May 30 '19 05:05 quinnj