wasm-micro-runtime
wasm-micro-runtime copied to clipboard
Very slow AOT code generation for large WASM files
Hi,
I have been experiencing some very slow code generation times for large WASM files.
I include a little benchmark I have done with this WASM file: large_code.zip. I compare:
wamrcfrom the currentmaintip (build inReleasemode)WAVMfrom our forkwasmtime: using v7.0.0
For each, I include the instructions to build, the command to generate the machine code, and the time it took.
wamrc:
Build wamrc with CMAKE_BUILD_TYPE=Release from the latest commit.
time wamrc -o large_code.aot large_code.wasm
# this takes around 3' in my system
WAVM:
Already built in the docker image in the previous command, you may find the Dockerfile here.
# Mounting `pwd` to have access to the .wasm file inside the container,
# all the build artifacts are in /build, so we can actually overwrite the /code
# directory
docker run --rm -it -w /code -v $(pwd):/code csegarragonz/wavm:faasm bash
time /build/bin/wavm compile large_code.wasm large_code.aot
# this takes around 1'20" in my system!
wasmtime:
Install using:
curl https://wasmtime.dev/install.sh -sSf | bash
time ~/.wasmtime/bin/wasmtime compile large_code.wasm
# this takes arround 6" in my system (?!?!)
Admittedly, I am not very familiar with wasmtime nor have I any idea why is it so much faster. I suspect I am doing something wrong. That being said, wasmtime uses a different code generator, but WAVM is also LLVM-based, so how come it is more than two times faster?
NB: this results are specific to my machine but, at least for the WAMR/WAVM comparison, I have seen consistent numbers in a variety of Intel x86 CPUs.
NB2: the attached WASM file contains a lot of custom native symbols only defined in our embedder, so you can not run it with iwasm. I thought it did not really matter to get the point across.
Hi, WAMR and WAVM are LLVM-based, wasmtime is cranelift based, it really takes more time for the former to compile wasm files. And WAMR uses llvm new pass manager and may apply more optimizations than WAVM, so it may take more time to compile wasm file than it. There may be some methods to reduce the compile time for wamrc:
- Try using size level 2 or 1:
wamrc --size-level=2/1 - Try using opt level 2:
wamrc --opt-level=2 - Try removing some optimizations, e.g,: https://github.com/bytecodealliance/wasm-micro-runtime/blob/dev/segue_opt/core/iwasm/compilation/aot_llvm_extra.cpp#L333-L343
- Remove the module verification: comment these lines in
verify_moduleand return true directly https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/iwasm/compilation/aot_compiler.c#L2604-L2615
@csegarragonz Recently we implemented the segue optimization for LLVM AOT/JIT, see #2230, normally (for many cases) it can improve the performance, reduce the compilation time of AOT/JIT and reduce the size of AOT/JIT code generated. Currently it supports linux platform and linux-sgx platform on x86-64, could you have a try? The usage is:
wamrc --enable-segue or wamrc --enable-segue=<flags>
iwasm --enable-segue or iwasm --enable-segue=<flags> (iwasm is built with LLVM JIT enabled)
flags can be:
i32.load, i64.load, f32.load, f64.load, v128.load,
i32.store, i64.store, f32.store, f64.store, v128.store
Use comma to separate them, e.g. --enable-segue=i32.load,i64.store.
Hey @wenyongh thanks for pointing this out!
Just to double check, will this optimisations benefit me if I am using x86-64 on linux with HW bound checks enabled?
As far as I can tell, bound checks weren't performed anyway, and were delegated to the OS by placing the linear memory at the begining of a contiguous patch of 8GB of virtual memory and protecting memory pages? Please correct me if I am wrong!
(Not for SGX, I understand the segue optimisation could benefit my SGX use cases)
Yes, it may benefit no matter the --bounds-checks=1 is added for wamrc or not. The memory access boundary check in the aot code only depends on the i + memarg.offset (i is popped from stack, memarg.offset is encoded in bytecode), it doesn't related to the base address of linear memory.
Normally the compilation time and the binary size can be reduced since the optimization simplifies the LLVM IRs to load/store the linear memory and decreases the size of load/store instructions. The performance may be degraded in some cases, we found that some LLVM optimizations may not take effect if the optimization is enabled, and it depends on which flags are enabled, for example for CoreMark workload, the performance gets worse if using warmc --enable-segue while gets better if using wamrc --enable-segue=i32.store.