Explore Wasmtime as an alternative WebAssembly runtime
Now that Wasmtime has no-std support, it becomes a possible alternative for the platform WASM runtime. This task should track the feasibility of using Wasmtime, since many roadblocks are expected (page size, memory and binary footprint, supported target architectures, releasing control flow, etc).
In particular, we should try to use Pulley.
There was recent developments on https://github.com/bytecodealliance/wasmtime/issues/7311. I tried to use Pulley on Nordic on the wasm-bench crate (see #753). It seems the generated Pulley bytecode is 34 times larger than the Wasm bytecode (it's an ELF file). Besides, it seems Wasmtime needs to copy it to RAM, which is another issue.
Thank you for starting a conversation about this in the BA's Zulip, @ia0! <3 As @alexcrichton said over there, we'll gladly help out with making Wasmtime a viable option however we can.
Alex already filed two issues[1, 2], which should address the issues with Pulley bytecode size, and having to have the bytecode in RAM.
Besides that, Alex also mentioned being able to reduce the size of the runtime itself, by removing the dependency on Serde. We know that there are other ways to shrink the binary size, but perhaps the biggest one might come from disabling a feature: SIMD support incurs a substantial size increase, because of how many opcodes need to be handled. Disabling that should shrink the interpreter meaningfully.
Thank you and Alex for the quick answers and follow-up!
Let me describe how WebAssembly is used in Wasefire and answer Alex's questions:
I realize that this may be a bit of a stretch, but if you're able to describe what your embedding does (or even better have a fork/project that can be built)
- Wasefire provides 2 APIs (the board API and the applet API) and a "scheduler" sitting between both.
- The board API is a hardware abstraction in Rust (board implementations and the scheduler are written in Rust). This API simplifies support for new embedded devices by just having to implement this API.
- The applet API is a system abstraction in WebAssembly (applets are Wasm modules and the scheduler is a Wasm runtime). This API provides applet portability across different embedded devices, think Embedded WASI except it's custom for now (#63).
- The scheduler is meant to run multiple applets concurrently, although currently only one applet can be installed (and executed) at a time. However, applets can be installed or updated dynamically (through a custom USB protocol).
- After considering Wasmtime, Wasmer, Wasmi, and Wasm3, I decided to write my own in-place interpreter[^1]. It differentiates itself from the rest with small binary size, small memory footprint, slow interpretation, returns control flow for host functions, and default function linking.
- There's also the option to link one native applet to the scheduler (bypassing WebAssembly). This is the another extreme in the design space (applet performance, applet sandboxing). Ideally Wasmtime would provide yet another point in the design space (applet performance/sandboxing, scheduler flash and RAM footprint and limited applet binary portability).
- I'm doing Wasm runtime experiments in
crates/wasm-bench. This benchmark uses the minimal CoreMark from Wasm3, and it's really just to get orders of magnitude (or even answer feasibility questions).
We know that there are other ways to shrink the binary size, but perhaps the biggest one might come from disabling a feature
I always use default-features = false and enable only what I use, so I'm already expecting to use the minimum set of Wasmtime features.
and/or describe what the wasm is doing (or even better share a sample wasm) that'd be awesome.
Applets use less than the WebAssembly MVP. The current interpreter doesn't even support SIMD. It also has optional floats (disabled by default). If you want to check some actual wasm modules, you can run cargo xtask applet rust NAME where NAME is the name of the applet and is just a crate at path examples/rust/NAME/Cargo.toml. This will produce target/wasefire/applet.wasm (and applet.wasm.orig in the same directory before wasm-strip and wasm-opt). The biggest example so far is opensk. Note that you can also use cargo xtask --release applet rust opensk (to remove debug printing support) or cargo xtask --release applet rust opensk --opt-level=z to optimize for size.
I'm currently on vacations (with the kid thus little time), but as soon as I'm back I'll try to see if I can add Wasmtime support behind a cargo feature. The main difficulty will be the fact that the scheduler currently assumes the runtime to return control flow on host function calls. I guess I'll be able to use the async API of Wasmtime for this purpose (without async runtime, just calling poll myself). Another will be the fact that the current interpreter accepts a way to always link imported functions, but that's only to support linking new applets on old platforms as long as the imported function is allowed to return an error (there's a common format for all functions) at runtime. That's probably not going to be a blocker.
I'll post updates on this issue.
[^1]: I later discovered Ben Titzer's paper A fast in-place interpreter for WebAssembly which ideas are currently being implemented in the dev/fast-interp branch.
A bit delayed, but thank you for writing that up! It'll take some time to fully digest this but I hope to poke at this in the future.
In the meantime https://github.com/bytecodealliance/wasmtime/pull/10285 triggered another thought/recommendation, you'll want to be sure to set Config::generate_address_map to false if you aren't already. That should ~halve the size of the *.cwasm and means that you'll lose the ability to get wasm bytecode offsets in backtraces, which I suspect is probably suitable for your use case. (although if it's not there's some assorted ideas on https://github.com/bytecodealliance/wasmtime/issues/3547 for making this section smaller)
Also, to confirm but I suspect you're already doing this, if you strip the binary before compiling it (e.g. remove the name section) it'll make the *.cwasm a bit smaller by removing that from the original binary. (or we could also plumb a Config option to retaining that in the *.cwasm if you'd prefer to not strip)
Sorry for the very long delay. I finally got time to use the new version in #819. You can see the diff for the tuning I had to do (rather simple). In terms of performance it's essentially 20x faster than what I currently use, uses 2x the memory, and 2.5x the flash. So it seems usable, I'll try to integrate it in the final product.
You can also see that compared to Wasmi, it has comparable performance, uses 50% more memory (but that's probably just my tuning, I could probably just ask for 32k wasm stack instead of 64k), but uses 2x less flash.
Also important, the changes on the cwasm were significative, the module is now 15k.
Oh that's awesome, thanks for the update!
FWIW the max_wasm_stack option doesn't actually proactively allocate stack it's instead just a limiter which prevents going above that threshold, so changing the setting probably won't lead to less memory consumption. That being said 2x memory may mean there's yet still to improve within Wasmtime, so if you're able to identify some of the larger allocations we can try to work on shrinking them or making more of them optional.
FWIW the
max_wasm_stackoption doesn't actually proactively allocate stack
For Pulley this is the case (and I guess it makes sense). I reduced it to 16k and seen the memory reduction.
Regarding memory usage, there are only 3 allocations of more than 1kB:
- 16k for the stack (or whatever
max_wasm_stackis set to) - 1056 for the VM itself
- 64k for the wasm memory
This seems very reasonable to me. The only remaining limitation is the binary size, but I'll just optimize wasmtime for size (since that's the biggest part, 70kB to 80kB) and pulley-interpreter for perf (since it gives 40% perf improvement for double the footprint from 25k to 50k).
64k for the wasm memory
That's something the custom page sizes proposal can help with, which is already supported by Wasmtime. It sounds like that should work well for your use case, potentially with page sizes of 4kb, or even less.
Oh right yes I forgot about that stack, sorry! That should definitely be ok to decrease as you see fit.
I'm not sure I reproduced exactly right but locally I was seeing a 680k binary for the wasm-bench folder compiled to a thumb target, and with https://github.com/bytecodealliance/wasmtime/pull/10727 I was able to get that number down to 580k, so if you don't need simd that should help? That should also shrink the size of the VM allocation too by removing the (probably unused) vector registers from the VM state, leaving just float/integer ops.
That's something the custom page sizes proposal can help with
Good point. That's definitely going to be useful when we'll support multi-applets. For now this is not a blocker.
I'm not sure I reproduced exactly right but locally I was seeing a 680k binary
Weird, the repo should be somewhat hermetic. Running the following command at commit d85f65661519e5159d3d79d240e12ae1dc70b60:
cargo-size --profile=release-size -Zbuild-std=core,alloc -Zbuild-std-features=panic_immediate_abort,optimize_for_size --target=thumbv7em-none-eabi --features=target-nordic,runtime-wasmtime
should give:
text data bss dec hex filename
276708 24 1220 277952 43dc0 wasm-bench
so if you don't need simd that should help
Indeed I don't need SIMD. The current interpreter I'm using don't even support them (and floats are behind a feature flag). I'll follow the PR.
Aha I was having some various issues which I have now resolved. I couldn't find cargo-size over the weekend so I was just looking at the output ELF size. Now I've found it though! Additionally I was using --release vs --profile=release-size. In release mode I'm seeing a 10% reduction in text size removing simd in the interpreter (410016 => 366260, 43756 bytes removed), and in release-size I'm seeing a 5% reduction in removing simd (273944 => 259424, 14520 bytes removed). This was compiling with/without CARGO_TARGET_THUMBV7EM_NONE_EABI_RUSTFLAGS=--cfg=pulley_disable_interp_simd on bd85f65661519e5159d3d79d240e12ae1dc70b60, the SHA you linked above I couldn't find in the repo.
I finally found some time to use this feature (see #890). The benefit seems even better:
- Text size went from 263kB to 224kB, so a 40kB reduction (15% on this small example).
- Performance on coremark-minimal.wasm went from 3.549 to 4.184, so a 18% improvement. I'm not sure if the performance improvement is expected, but it's reproducible.
Some things to note:
- I realized
cargo sizeis the wrong tool to measure text size because it ignoresbuild-std. So instead I'm usingrust-sizeon the ELF directly. - The pulley bytecode is twice bigger on this benchmark: 7.77kB for wasm versus 15.7kB for pulley. If this is just a 8k constant it's not an issue. If it's a 2x factor, that might be a limitation. Practice will tell.
- Wasmtime doesn't compile on
riscv32imc-unknown-none-elfbecausegimli(and other crates) usealloc::sync::Arcwhich requires atomics. Using theportable-atomicandportable-atomic-utilcrates would be a solution. If/When I'll need to support pulley on this target, I'll create PRs to those projects.
Next step for me is to try to integrate pulley in the final product (focusing on nRF52840 and thus thumbv7em-none-eabi). This might take some time, in particular because I want to have it as an alternative to the existing wasm (using my basic in-place interpreter) and native (i.e. without sandboxing) runtimes.
Nice! I wouldn't have expected the speedups but hey seems nice :)
w.r.t. cwasm size, some local poking around is:
- 66% is pulley bytecode for wasm-defined functions - this'll be proportional to the input module size (roughly)
- 9.6% is the symbol table which isn't needed at runtime, issue to track removing it here
- 8.8% is wasmtime metadata (the
.wasmtime.{info,engine}sections) which are fixed cost and shouldn't increase much with module size - 8.5% is pulley bytecode for trampolines that Wasmtime uses - shouldn't be proportional to input wasm
- 1.9% is the module's own data (e.g.
datasegments), which is proportional to what's in the module
Above (2) can be dropped entirely with some minor work in Wasmtime. The (3) sections can be shrunk in theory but likely won't provide much benefit as they're fixed-cost mostly. The (4) section can probably be shrunk with some more clever options/configuration implemented in Wasmtime as we probably don't need all of the trampolines all the time for all modules. The biggest portion, (1), will require optimizing pulley bytecode further for size. So far it's mostly been optimized for decode-time.
If you're curious you can explore the pulley opcodes with wasmtime objdump --addresses --bytes and for example a random snippet I see is:
55f: 07 8d fe ff ff jump -0x173 // target = 0x3ec
564: 07 00 00 00 00 jump 0x0 // target = 0x564
569: 83 06 11 20 00 00 00 xload32le_o32 x6, x17, 32
570: 83 07 11 1c 00 00 00 xload32le_o32 x7, x17, 28
577: 41 08 xzero x8
579: 97 09 00 00 d8 1c xload32le_g32 x9, x7, x6, x24, 0
57f: 97 03 04 00 c9 1c xload32le_g32 x3, x7, x6, x9, 4
585: 97 19 00 00 c9 1c xload32le_g32 x25, x7, x6, x9, 0
58b: 97 0a 04 00 d9 1c xload32le_g32 x10, x7, x6, x25, 4
591: 9b 04 00 c9 1c 0a xstore32le_g32 x7, x6, x9, 4, x10
597: 97 0a 00 00 d9 1c xload32le_g32 x10, x7, x6, x25, 0
59d: 9b 00 00 c9 1c 0a xstore32le_g32 x7, x6, x9, 0, x10
5a3: 9b 04 00 d9 1c 03 xstore32le_g32 x7, x6, x25, 4, x3
5a9: 9b 00 00 d9 1c 08 xstore32le_g32 x7, x6, x25, 0, x8
here Pulley's jump opcodes always have a 4-byte target but it'd probably make sense to add a 1-byte target too (although using that in Cranelift will be difficult). Loads/stores are significantly larger than their equivalent wasm opcodes due to all the information they're encoding for example.
Overall we could probably shave 5-10% off of a function's encoding size with enough elbow grease, but at the end of the day all we'd be doing is reducing the constant factor that the wasm increases by when it's compiled to Pulley. Significantly shrinking further beyond that would require a redesign of the bytecode format (e.g. a stack machine instead of today's register-based machine) and is probably off the table.
Otherwise for riscv32imc and atomics, that may be a bit difficult working that into upstream projects. Do you know if it's possible to instruct LLVM to not actually emit atomics? For example WebAssembly has -mthread-model=single-thread (or something like that) which lowers atomics to scalar instructions and that's probably what you'd want for riscv32imc rather than threading pseudo-atomics around.
Thanks! So ignoring fixed costs (which are negligible), we have that the text section of the cwasm is 52% bigger than the complete wasm, and 25% bigger than the extended wasm we produce for the current in-place interpreter (which contains a pre-computed side-table which adds 22% over wasm). I guess this additional 25% is probably fine for now (where we are still in experiment mode).
I subscribed to the issue regarding removing the symbol tables during compilation (instead of stripping after).
Overall we could probably shave 5-10% off of a function's encoding size with enough elbow grease
Yeah, I don't think this is worth it at this point.
Do you know if it's possible to instruct LLVM to not actually emit atomics?
I'm not sure if that's the only thing to do. It seems to me I'll need to create a new compilation target and set target_has_atomic = "ptr" (since that's what gates alloc::sync). Then indeed there might be additional changes on LLVM side. I'll assess this if/when I'll need it.
Thanks a lot @alexcrichton and @tschneidereit for your help! I've finished integrating Wasmtime into Wasefire. It is not possible to build a platform and applets with Pulley. The footprint is considerable but this permits to make 2 examples work that otherwise wouldn't:
- A security key firmware using OpenSK (this was limited by the wasm interpreter which can't represent the side-table, fixable in theory)
- A BLE advertisement packet sniffer (this was limited by performance)
So Wasmtime is running on a platform with 256K of RAM and 336K of flash. The applet (opensk) is separate and is 262K of flash. I wonder if there has been precedent to run Wasmtime on such resource constraint devices.
That's awesome and thanks for pushing on this! That's definitely the most resource constrained embedding that I'm aware of myself :)
Would you be ok if we link this usage from our documentation and cite you on those numbers?
Would you be ok if we link this usage from our documentation and cite you on those numbers?
Sure, with pleasure, you can go ahead :-)
Congratulations and great work @ia3, I'm really proud to see that Wasmtime worked for you in a constrained environment!