wasmtime icon indicating copy to clipboard operation
wasmtime copied to clipboard

Excessive compile time for "simple" module

Open alexcrichton opened this issue 7 months ago • 1 comments

OSS-Fuzz has had an open bug for this for quite some time but I'm only just now getting around to filing an issue about this. The gist of the fuzz bug is that Wasmtime times out in the compile test target where the only thing that target does is compile a module. A timeout here means that the 60s time limit is exceeded, and the time limit here is with sanitizers and parallelism disabled on OSS-Fuzz infrastructure. This can be roughly approximated by running locally in release mode with -Cparallel-compilation=n and multiplying the result by ~30.

The module in question is:

(module 
    (func)
    (func)
    ;; ... 119248 times ...
    (func)
)

aka this is just a giant module of a lot of empty functions. foo.wasm.gz is the compresed version of this module.

Locally I see:

$ time wasmtime compile -C parallel-compilation=n foo.wasm
wasmtime compile -C parallel-compilation=n foo.wasm  1.05s user 0.15s system 99% cpu 1.205 total

which, for being a bunch of empty functions, is quite a lot! There's a lot of functions in this module but 10 microseconds for an empty function feels a bit excessive regardless. While optimizing this probably won't help out too too much in the long term, it's perhaps worthwhile to still try to improve this if not just for oss-fuzz timeouts and fuzzing.

A profile of the compilation looks like this which notably spends the most amount of "self" time in memmove. There's also notably a fair amount of allocation traffic as well. I'm not sure how to improve the regalloc2 parts myself but for memmove that I do know how to improve.

Moving MachBuffer less

The basic problem I've seen is that we're pretty liberal in Cranelift about moving data structures by ownership between phases, notably the MachBuffer<T>. This type is very large (lots of SmallVec) and is created/moved quite a lot throughout a compilation. This I believe adds up to quite a lot of memmove costs.

The movements I've seen are:

In general I don't think rustc/LLVM are even capable of eliding most of these copies which means that for each function we're copying a this very large structure ~6 times (ish). Multiply that by ~100k and the size of the structure and that's a lot of memory moving around and can probably explain at least a good portion of the second of compile time for this module.

Ideally we would refactor cranelift to require much less movement of the MachBuffer type. In an ideal world we could even reuse MachBuffer structures between compiling functions too. In any case we can probably get a long-ish way restructuring things and ownership of the MachBuffer

Other compiler structures

I've seen other compiler structures in the profile be quite large, such as CompilerContext, which are moved around a lot. Ideally we could perhaps Box up some contexts and or make movements cheaper to work with.


I'm sure there's other parts of the profile to dig in to as well, but I wanted to at least file an issue in case anyone's interested in chipping away at some pieces here.

alexcrichton avatar Apr 14 '25 16:04 alexcrichton

The intent of MachBuffer's inline SmallVecs was to avoid any allocations at all when on-stack, moved not at all or only once, and used for small-to-medium-sized functions, but that was a very speculative benefit when we sketched out the new backend framework in 2020, and it looks likely that refactors and added complexity through the years have adapted to moving it around liberally (as is more idiomatic).

Perhaps a simple fix could be to Box it up? Alternately we convert all the SmallVecs to Vecs. I don't have the cycles to try and benchmark either at the moment -- this is just to say that there's no reason we should be wedded to the current approach other than legacy...

cfallin avatar Apr 15 '25 01:04 cfallin