wasm-tools icon indicating copy to clipboard operation
wasm-tools copied to clipboard

Refactor wasmparser to avoid "double parsing"

Open alexcrichton opened this issue 3 years ago • 3 comments

There's currently two primary reasons that a wasm module has bits and pieces of it which end up being "double parsed":

  • First is that the BinaryReader has some skip_* methods which are used. These methods are typically used to conform to the iterator protocol of Rust. For example when looking at ElementSectionReader when you call read it acts as an iterator, repositioning at the next element. The processing of the first element, however, happens by the consumer, which must happen afterwards. This means that the read method must skip to the next element for the next call to `read. Affected locations are:

    • ElementSectionReader::read
    • DataSectionReader::read
    • GlobalSectionReader::read
    • FunctionBody::get_operators_reader
    • FunctionLocalReader::read (and probably more in this file)
    • InstanceSectionReader::read
  • Secondly the API design of the Validator type is such that it will always parse "header" sections, and then consuming applications (like wasmtime) are likely to then re-parse content of the section again. For example Validator::import_section will parse the import section, but then wasmtime also will iterate over the import section, re-parsing everything.

In general this isn't a massive concern because the header sections are likely all dwarfed in size by the code section so parsing is quite fast. Nonetheless we should strive to parse everything in a wasm file precisely once. I think to fix this we'll need two features, one for each problem above:

  1. For the first issue I think we'll want to move towards a more advancing-style API rather than an iterator-based API. For example we'd have a dedicated type for reading the element section, and you'd say "read the header" followed by "read the elements". We might be able to use Drop and clever trickery to skip over data that wasn't explicitly read, or we could simply panic if methods aren't called in the right order. The downside of this is that consumers are likely going to get a little more complicated, but this may be fixable with clever strategies around APIs. I'm not sure how this would exactly look like.

  2. For the second issue we'll want to add more APIs to the validator. For example instead of taking the import section as a whole we'd probably want to add something like "the import section is starting with this many items" which gives you a "sub-validator" which is used to validate each import after parsing. What I'm roughly imagining is that the application does all the parsing and then just after parsing feeds in everything to the validator. Another possible alternative is a "validating parser" which automatically feeds parsed values into the validator before handing them to the application. I'm not sure if this alternative is possible with "parse everything precisely once", however, since for example the element section ideally shouldn't be parsed twice, just once.

alexcrichton avatar Jan 07 '21 20:01 alexcrichton

I just hit a bit of an extreme case of this today, and is showing how this is a fundamental flaw I think in the current API design of wasmparser. mutated.wasm.gz is a wasm module that currently takes ~30ms to validate in wasmparser, which is quite a long time relative to the size of the module, 557k. This is a fuzz-generated module where the vast majority of the size is the element section:

types                                    |        0xc -     0x5d9d |     23953 bytes | 557 count
imports                                  |     0x5da0 -     0x64cb |      1835 bytes | 256 count
functions                                |     0x64cd -     0x64d2 |         5 bytes | 2 count
memories                                 |     0x64d4 -     0x64dc |         8 bytes | 1 count
globals                                  |     0x64df -     0x8aad |      9678 bytes | 759 count
exports                                  |     0x8ab0 -     0x90e7 |      1591 bytes | 243 count
elements                                 |     0x90eb -    0x8b0b0 |    532421 bytes | 314 count
code                                     |    0x8b0b3 -    0x8b1c2 |       271 bytes | 2 count

This was specifically exposed with the current mutate fuzzer which, as of the time of this writing, may validate the input module up to 100 times as 100 different mutations are tested. With a 30x ish slowdown while fuzzing that means that the 60s time limit gives 20ms-per-iteration to validate this module, hence the timeout which originally led to this comment (as validation takes 30ms today)

I was originally thinking of simply reducing the 100 iteration limit to 10 iterations, but I tested Node.js and it can validate this module 10x faster than wasmparser, doing so in 3ms instead of 30ms. It turns out that validating large element sections defined with expressions is egregiously bad in wasmparser in terms of duplicated reads:

I believe this means that all initialization expressions are parsed 3 times (once to skip per segment, once to skip to read the init expr, once to actually read the init expr). I think that the individual elements are parsed 3 times as well (once to skip per segment, once when reading the expression, once when then actually reading for validation). If wasmparser were embedded somewhere that would actually double these costs since the element section is parsed internally inside of the validator, so all of the parsing would need to happen again externally from the validator.

Personally I see no way to solve this in wasmparser without a fundamental redesign of how parsing/validation is expected to work. The current design of "read something, feed it to validate, then read it yourself" fundamentally requires this re-parsing. This sort of works for length-prefixed sections like the code section and each individual function, but otherwise wasm is not structured to be easily skippable within a section.

alexcrichton avatar Mar 07 '22 16:03 alexcrichton

Just a wild shot, couldn’t the skipping be implemented in the drop implementation of the value returned by the iterator? While it wouldn’t resolve the problem in general, it could help with inherent double-parsing of some entities. The downside is, of course – you can no longer keep a reader alive before advancing the iterator again, but usefulness of that seems dubious?

nagisa avatar Aug 04 '22 20:08 nagisa

I do think that could solve some of the double-parsing issues, yeah. I would worry though about the complexity of implementing that in addition to ensuring that it doesn't generate unnecessary checks/etc in the case where the values are actually consumed.

alexcrichton avatar Aug 05 '22 18:08 alexcrichton