toywasm
toywasm copied to clipboard
Benchmarks
Hi @yamt ,
it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!
I have a few questions though.
What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility.
Also I saw that wasmi
is included in the script but not showing in the README.md
.
Did wasmi
not work? If it works I'd be interested in the numbers on your machine. :)
Unfortunately running the benchmark script requires quite a setup.
Hi @yamt ,
it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!
thank you. but honestly speaking i feel i added too many runtimes. the purpose of the benchmark was to give a rough idea about toywasm performance. a few runtimes were enough.
I have a few questions though. What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility. Also I saw that
wasmi
is included in the script but not showing in theREADME.md
. Didwasmi
not work? If it works I'd be interested in the numbers on your machine. :) Unfortunately running the benchmark script requires quite a setup.
i run it on my macbook. (MBP 15-inch 2018) the latest wasmi works. (at least to complete the benchmark) it will be included when i happen to run it again.
but honestly speaking i feel i added too many runtimes.
Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.
Btw. in order to highlight the strength of the approach you took in your toywasm
you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.
it will be included when i happen to run it again.
done. (actually i just pushed an unpushed result i found in local repo.)
but honestly speaking i feel i added too many runtimes.
Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.
maybe. i myself am not interested in maintaining such a thing right now though.
Btw. in order to highlight the strength of the approach you took in your
toywasm
you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.
thank you for your insights. i agree.
Btw. in order to highlight the strength of the approach you took in your
toywasm
you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.thank you for your insights. i agree.
i added a benchmark about startup time and memory consumption. https://github.com/yamt/toywasm/blob/master/benchmark/startup.md
Thank you for those benchmarks! They are really insightful. Seems like there is some room for improvement for wasmi. Your toywasm looks very strong! :)
@yamt thanks to your memory consumption benchmarks I took a better look at wasmi
's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0
. This also improved translation time (startup time). Thanks again for that initial spark! :)
@yamt thanks to your memory consumption benchmarks I took a better look at
wasmi
's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version0.30.0
. This also improved translation time (startup time). Thanks again for that initial spark! :)
it's good to hear! thank you for letting me know.
noted for the next run of the benchmark. (probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)
noted for the next run of the benchmark.
looking forward :)
(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)
Oh wow, that's super interesting news since toywasm
is also an interpreter just like wasmi
and I always decided against implementing SIMD in wasmi
since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)
Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.
noted for the next run of the benchmark.
looking forward :)
(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)
Oh wow, that's super interesting news since
toywasm
is also an interpreter just likewasmi
and I always decided against implementing SIMD inwasmi
since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.
i have a similar feeling. but i added it mainly for completeness.
having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.
having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.
At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.
I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm
. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.
having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.
At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.
i guess it uses llvm as a backend? llvm seems to use simd instructions in some interesting ways.
I assume you used 64-bit cells for the value stack before the introduction of SIMD to
toywasm
. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.
in toywasm, the value stack cell size depends on build-time configurations.
before simd, there were two configurations:
- 32-bit cell, i64 uses two cells. (default)
- 64-bit cell, any values use a cell. (faster in many cases. i suppose wasmi
UntypedValue
works similarly to this.)
after simd, there are three:
- 32-bit cell, i64 uses two cells, v128 uses four cells. (still default)
- 64-bit cell (when simd is disabled)
- 128-bit cell (when simd is enabled)
in toywasm, the value stack cell size depends on build-time configurations.
ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?
i suppose
wasmi
UntypedValue
works similarly to this.
yes, that's correct.
very interesting approach and looking forward to all the results that you are going to pull off of this. :)
in the past I have been using wasm-coremark
to test basic performance of computations for wasmi
in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.
however, this is a rather artificial benchmark and probably less ideal than your ffmpeq
and spidermonkey
testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi
performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi
and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.
Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.
in toywasm, the value stack cell size depends on build-time configurations.
ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?
originally toywasm was using fixed 64-bit cells.
later i added TOYWASM_USE_SMALL_CELLS
option to use small (32-bit) cells.
you can see the code blocks ifdef'ed on this macro to see how much complexity is involved.
besides that, i introduced TOYWASM_USE_RESULTTYPE_CELLIDX
and TOYWASM_USE_LOCALTYPE_CELLIDX
to speed up by-index value stack accesses like local.get
.
(when using small cells, local.get
imm somehow needs to be converted to the corresponding location of the cell(s).)
you can consider them as a part of TOYWASM_USE_SMALL_CELLS
as well.
i suppose that it can be simpler for "translating" interpreters like wasmi
because
you can embed many of these pre-calculated information into the translated internal opcodes themselves.
i suppose
wasmi
UntypedValue
works similarly to this.yes, that's correct.
very interesting approach and looking forward to all the results that you are going to pull off of this. :)
in the past I have been using
wasm-coremark
to test basic performance of computations forwasmi
in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.however, this is a rather artificial benchmark and probably less ideal than your
ffmpeq
andspidermonkey
testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example,wasmi
performs quite okay on Intel CPUs and super poorly on M1/2 whereas bothwasmi
and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.
interesting. i haven't thought about cpu differences much. all my benchmarks are with:
ProductName: macOS
ProductVersion: 12.6.5
BuildVersion: 21G531
MacBook Pro (15-inch, 2018)
2.2 GHz 6-Core Intel Core i7
What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.
Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.
What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.
Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.
it's an interesting idea. random thoughts:
- i guess you can even separate stack for 32-bit and 64-bit.
- v128 values with 128-bit alignment is a nice property for at least certain cpus.
- function parameters/results might be a bit tricky to implement with the approach.
- the height for each stacks need to be tracked for possible unwinding. (eg.
br
) - i'm not sure if it's less or more complex as a whole.
noted for the next run of the benchmark.
looking forward :)
i rerun the benchmarks: https://github.com/yamt/toywasm/blob/master/benchmark/ffmpeg.md https://github.com/yamt/toywasm/blob/master/benchmark/startup.md
wasmi has been improved a lot since the last time. (0.27.0) good work!
Awesome work @yamt and thanks a ton for those benchmarks! 🚀
I am especially fond of the fact that there is nearly no difference between toywasm (SIMD)
and toywasm (no SIMD)
so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.
Looks like a very successful research conclusion to me for your SIMD implementation in toywasm
! :)
Concerning wasmi
performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0
version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)
Awesome work @yamt and thanks a ton for those benchmarks! 🚀
I am especially fond of the fact that there is nearly no difference between
toywasm (SIMD)
andtoywasm (no SIMD)
so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.Looks like a very successful research conclusion to me for your SIMD implementation in
toywasm
! :)
i guess ffmpeg.wasm (or, probably any C programs) is rather linear-memory intensive than value stack.
Concerning
wasmi
performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe thewasmi 0.27.0
version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)
hmm. wrt 0.27.0, it might be an error on my side. i manually built both versions of wasmi locally as: https://github.com/yamt/toywasm/blob/master/benchmark/notes.md#wasmi
Ah I thought your were simply installing wasmi
via cargo install wasmi_cli
.
The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install
. For wasmi
version 0.30.0
I made sure the optimizations are applied when installing via cargo install
. :)
wasmi
is heavily depending on lto="fat"
as well as codegen-units=1
optimization configs. Without them wasmi
performance easily drops by 100% in some cases (or even more in others).
I just checked the wasmi
Cargo.toml
and it seems that if you are building wasmi
like this then these optimizations are almost certainly not applied. I should probably change the default --release
build here but I was not expecting people to build wasmi
from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi
so I usually only enable them for benchmarks or releases.
Ah I thought your were simply installing
wasmi
viacargo install wasmi_cli
. The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed viacargo install
. Forwasmi
version0.30.0
I made sure the optimizations are applied when installing viacargo install
. :)
things like cargo install
go install
etc scare me a bit. :-)
wasmi
is heavily depending onlto="fat"
as well ascodegen-units=1
optimization configs. Without themwasmi
performance easily drops by 100% in some cases (or even more in others). I just checked thewasmi
Cargo.toml
and it seems that if you are buildingwasmi
like this then these optimizations are almost certainly not applied. I should probably change the default--release
build here but I was not expecting people to buildwasmi
from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time ofwasmi
so I usually only enable them for benchmarks or releases.
it reminded me that, while i wanted to use lto=full for toywasm, cmake insisted to use lto=thin. https://github.com/yamt/toywasm/blob/9c88f24924b8249ad259dbaf62239855dabd219f/lib/CMakeLists.txt#L83-L87
in the meantime, i added a warning about this: https://github.com/yamt/toywasm/commit/9ee47bf2a2cb5d19122b4944d476f08f38f1a531
If you want to benchmark wasmi
with full optimizations and build it from sources you can build it via:
cargo build --profile bench
So instead of --release
you do --profile bench
. :)
Made a pull request to your benchmark docs so it is documented: https://github.com/yamt/toywasm/pull/39
Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.
If you want to benchmark
wasmi
with full optimizations and build it from sources you can build it via:cargo build --profile bench
So instead of
--release
you do--profile bench
. :) Made a pull request to your benchmark docs so it is documented: #39
thank you. i commented in the PR.
Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.
i have no interest in blog right now.
If you want to benchmark
wasmi
with full optimizations and build it from sources you can build it via:cargo build --profile bench
i rerun with it and pushed the results.
i also updated the procedure for wasmer. (it was not clearing cache as i intended.)
Hi @yamt , thanks a lot for updating me about this!
The new wasmi
results look much more as I would expect, being roughly twice as slow as Wasm3.
It is interesting that your toywasm
has similar startup performance to WAMR classic interpreter but performs way better than it at runtime.
What are your plans forward with toywasm
?
Btw.: I am currently working on a new engine for wasmi
making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)
Hi @yamt , thanks a lot for updating me about this!
The new
wasmi
results look much more as I would expect, being roughly twice as slow as Wasm3.
good.
It is interesting that your
toywasm
has similar startup performance to WAMR classic interpreter but performs way better than it at runtime. What are your plans forward withtoywasm
?
actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run. i haven't investigated further. probably i should, sooner or later.
Btw.: I am currently working on a new engine for
wasmi
making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)
interesting. is it this PR? https://github.com/paritytech/wasmi/pull/729
actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.
No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi
. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.
Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.
is it this PR? https://github.com/paritytech/wasmi/pull/729
Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.
actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.
No runtime can be efficient for all the use cases - at least that's what I learned from working on
wasmi
. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.
sure.
being considerably slower than a similar engine (in my case iwasm classic) for a specific app is likely a sign of weak spots, or even a bug.
Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.
i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.
is it this PR? paritytech/wasmi#729
Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.
a lot of interesting ideas in the PR. i'm looking forward to see how it performs.
i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.
I can only talk for the use cases of my employer. We use Wasm in two different ways:
- Executing the hot-patchable plugin-like runtime of entire system. This is to allow our users to customize their frameworks. This is usually a longer running process and it is likely that a good chunk of the available functionality is going to be executed eventually. We use Wasmtime as the executor.
- Executing smart contracts. Due to the associated cost model developers of smart contracts are rewarded by doing as little as possible upon a call to a smart contract. Therefore a single smart contract execution usually only uses fraction of the entire Wasm blob for its execution, mostly executing a single function. Persistent data is loaded from and stored to the blockchain. Here we use
wasmi
.
very interesting. thank you for sharing use cases.
Executing smart contracts. Due to the associated cost model developers of smart contracts are rewarded by doing as little as possible upon a call to a smart contract. Therefore a single smart contract execution usually only uses fraction of the entire Wasm blob for its execution, mostly executing a single function. Persistent data is loaded from and stored to the blockchain. Here we use
wasmi
.
the blob size itself doesn't cost there unless it's actually executed?