toywasm icon indicating copy to clipboard operation
toywasm copied to clipboard

Benchmarks

Open Robbepop opened this issue 2 years ago • 54 comments

Hi @yamt ,

it is really cool that you put so many Wasm runtimes on your benchmarks for comparison! I have a few questions though. What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility. Also I saw that wasmi is included in the script but not showing in the README.md. Did wasmi not work? If it works I'd be interested in the numbers on your machine. :) Unfortunately running the benchmark script requires quite a setup.

Robbepop avatar Feb 20 '23 19:02 Robbepop

Hi @yamt ,

it is really cool that you put so many Wasm runtimes on your benchmarks for comparison!

thank you. but honestly speaking i feel i added too many runtimes. the purpose of the benchmark was to give a rough idea about toywasm performance. a few runtimes were enough.

I have a few questions though. What hardware did you run the benchmarks on? Would be cool if you could write that down somewhere for reproducibility. Also I saw that wasmi is included in the script but not showing in the README.md. Did wasmi not work? If it works I'd be interested in the numbers on your machine. :) Unfortunately running the benchmark script requires quite a setup.

i run it on my macbook. (MBP 15-inch 2018) the latest wasmi works. (at least to complete the benchmark) it will be included when i happen to run it again.

yamt avatar Feb 21 '23 06:02 yamt

but honestly speaking i feel i added too many runtimes.

Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

Robbepop avatar Feb 21 '23 07:02 Robbepop

it will be included when i happen to run it again.

done. (actually i just pushed an unpushed result i found in local repo.)

yamt avatar Feb 21 '23 13:02 yamt

but honestly speaking i feel i added too many runtimes.

Yeah maybe you did. I think this benchmark with so many different runtimes could even be extended into its own project/repo.

maybe. i myself am not interested in maintaining such a thing right now though.

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

thank you for your insights. i agree.

yamt avatar Feb 21 '23 13:02 yamt

Btw. in order to highlight the strength of the approach you took in your toywasm you should definitely also provide benchmarks with other metrics. Namely memory consumption and startup times. I think your Wasm runtime which uses the raw Wasm bytecode with just little instrumentation should be fairly good in these two cathegories and may stand out. There certainly are use cases preferring those metrics over execution speed.

thank you for your insights. i agree.

i added a benchmark about startup time and memory consumption. https://github.com/yamt/toywasm/blob/master/benchmark/startup.md

yamt avatar May 07 '23 09:05 yamt

Thank you for those benchmarks! They are really insightful. Seems like there is some room for improvement for wasmi. Your toywasm looks very strong! :)

Robbepop avatar May 07 '23 13:05 Robbepop

@yamt thanks to your memory consumption benchmarks I took a better look at wasmi's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0. This also improved translation time (startup time). Thanks again for that initial spark! :)

Robbepop avatar May 28 '23 20:05 Robbepop

@yamt thanks to your memory consumption benchmarks I took a better look at wasmi's internal bytecode and was able to make it more space efficient while keeping the original performance or even slightly boosting it for version 0.30.0. This also improved translation time (startup time). Thanks again for that initial spark! :)

it's good to hear! thank you for letting me know.

noted for the next run of the benchmark. (probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

yamt avatar May 29 '23 13:05 yamt

noted for the next run of the benchmark.

looking forward :)

(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

Oh wow, that's super interesting news since toywasm is also an interpreter just like wasmi and I always decided against implementing SIMD in wasmi since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)

Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.

Robbepop avatar May 29 '23 14:05 Robbepop

noted for the next run of the benchmark.

looking forward :)

(probably not too distant future, as i want to see how toywasm regressed by the recent simd addition.)

Oh wow, that's super interesting news since toywasm is also an interpreter just like wasmi and I always decided against implementing SIMD in wasmi since I felt it would just slow down the entire interpreter for not too many actual gains. However, proper benchmarks to verify or disproof this "feeling" is always the best! :)

Having a Wasm runtime that is up to date with the standardised proposals is obviously very nice.

i have a similar feeling. but i added it mainly for completeness.

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

yamt avatar May 29 '23 22:05 yamt

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.

I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.

Robbepop avatar May 30 '23 07:05 Robbepop

having said that, these "do more work per instruction" style instructions can be rather friendly to interpreters like toywasm because it can hide instruction-fetch-parse overhead.

At Parity we even found that generating Wasm from Rust source is slightly smaller when enabling Wasm SIMD which is obviously great since translation time can be significant for some practical use cases and it usually linearly scales with Wasm blob size.

i guess it uses llvm as a backend? llvm seems to use simd instructions in some interesting ways.

I assume you used 64-bit cells for the value stack before the introduction of SIMD to toywasm. If that's the case, have you simply increased cell size to 128-bit to fit 128-bit vectors from SIMD or do those 128-bit vectors not occupy 2 cells. The latter design is probably more complex but would probably result in fewer regressions of non-SIMD instruction execution and memory consumption overall.

in toywasm, the value stack cell size depends on build-time configurations.

before simd, there were two configurations:

  • 32-bit cell, i64 uses two cells. (default)
  • 64-bit cell, any values use a cell. (faster in many cases. i suppose wasmi UntypedValue works similarly to this.)

after simd, there are three:

  • 32-bit cell, i64 uses two cells, v128 uses four cells. (still default)
  • 64-bit cell (when simd is disabled)
  • 128-bit cell (when simd is enabled)

yamt avatar May 30 '23 10:05 yamt

in toywasm, the value stack cell size depends on build-time configurations.

ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?

i suppose wasmi UntypedValue works similarly to this.

yes, that's correct.

very interesting approach and looking forward to all the results that you are going to pull off of this. :)

in the past I have been using wasm-coremark to test basic performance of computations for wasmi in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.

however, this is a rather artificial benchmark and probably less ideal than your ffmpeq and spidermonkey testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.

Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.

Robbepop avatar May 30 '23 11:05 Robbepop

in toywasm, the value stack cell size depends on build-time configurations.

ah, that's very interesting and perfect for research about what cell size is the best for which use case. :) how much complexity did this add to the interpreter compared to having for example fixed 64-bit cell sizes?

originally toywasm was using fixed 64-bit cells. later i added TOYWASM_USE_SMALL_CELLS option to use small (32-bit) cells. you can see the code blocks ifdef'ed on this macro to see how much complexity is involved.

besides that, i introduced TOYWASM_USE_RESULTTYPE_CELLIDX and TOYWASM_USE_LOCALTYPE_CELLIDX to speed up by-index value stack accesses like local.get. (when using small cells, local.get imm somehow needs to be converted to the corresponding location of the cell(s).) you can consider them as a part of TOYWASM_USE_SMALL_CELLS as well.

i suppose that it can be simpler for "translating" interpreters like wasmi because you can embed many of these pre-calculated information into the translated internal opcodes themselves.

i suppose wasmi UntypedValue works similarly to this.

yes, that's correct.

very interesting approach and looking forward to all the results that you are going to pull off of this. :)

in the past I have been using wasm-coremark to test basic performance of computations for wasmi in comparison with Wasmtime and Wasm3 using https://github.com/Robbepop/wasm-coremark-rs/tree/rf-update-vms-v2.

however, this is a rather artificial benchmark and probably less ideal than your ffmpeq and spidermonkey testcases. what I found out is that the runtime performance of all 3 Wasm runtimes were extremely dependent on the underlying hardware. For example, wasmi performs quite okay on Intel CPUs and super poorly on M1/2 whereas both wasmi and Wasm3 furthermore perform pretty badly on AMD CPUs. And Wasmtime performs way better on AMD CPUs than on Intel or M1/2.

Given these severe differences I think it is kinda important to tag your own benchmark results for reproducibility with the hardware (mainly CPU) and OS used.

interesting. i haven't thought about cpu differences much. all my benchmarks are with:

ProductName:    macOS
ProductVersion: 12.6.5
BuildVersion:   21G531
MacBook Pro (15-inch, 2018)
2.2 GHz 6-Core Intel Core i7

yamt avatar May 30 '23 13:05 yamt

What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.

Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.

Robbepop avatar May 30 '23 14:05 Robbepop

What just crossed my mind about cell sizes and SIMD support is the following: Maybe it is practical and efficient to have 2 different stacks, e.g. one stack with 64-bit cells and another stack with 128-bit cells. Both stacks are used simultaneously (push, pop) but exclusively for non-SIMD and SIMD instructions respectively. Due to Wasm validation phase and type checks it should probably be possible to support SIMD without touching the already existing stack and not using this 128-bit cell stack at all (and thus not affecting non-SIMD code) when not using SIMD instructions.

Maybe I am overlooking something here. Although if this was efficient I assume it might introduce less complexity than different cell sizes or having SIMD instructions use 2 cells instead of 1. I am way into speculation here. Implementation/time needed to confirm haha.

it's an interesting idea. random thoughts:

  • i guess you can even separate stack for 32-bit and 64-bit.
  • v128 values with 128-bit alignment is a nice property for at least certain cpus.
  • function parameters/results might be a bit tricky to implement with the approach.
  • the height for each stacks need to be tracked for possible unwinding. (eg. br)
  • i'm not sure if it's less or more complex as a whole.

yamt avatar May 31 '23 00:05 yamt

noted for the next run of the benchmark.

looking forward :)

i rerun the benchmarks: https://github.com/yamt/toywasm/blob/master/benchmark/ffmpeg.md https://github.com/yamt/toywasm/blob/master/benchmark/startup.md

wasmi has been improved a lot since the last time. (0.27.0) good work!

yamt avatar May 31 '23 14:05 yamt

Awesome work @yamt and thanks a ton for those benchmarks! 🚀

I am especially fond of the fact that there is nearly no difference between toywasm (SIMD) and toywasm (no SIMD) so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.

Looks like a very successful research conclusion to me for your SIMD implementation in toywasm! :)

Concerning wasmi performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0 version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)

Robbepop avatar May 31 '23 15:05 Robbepop

Awesome work @yamt and thanks a ton for those benchmarks! 🚀

I am especially fond of the fact that there is nearly no difference between toywasm (SIMD) and toywasm (no SIMD) so maybe fixed 128-bit cells are the way to go and not at all super terrible? 🤔 Obviously they consume a bit more memory but even that difference isn't all too significant imo.

Looks like a very successful research conclusion to me for your SIMD implementation in toywasm! :)

i guess ffmpeg.wasm (or, probably any C programs) is rather linear-memory intensive than value stack.

Concerning wasmi performance: The optimizations I have implemented lately cannot explain this extreme difference so I rather think that maybe the wasmi 0.27.0 version got released without proper optimizations enabled. Unfortunately there is a bug in Cargo (the build tool) that requires manual handling for this to happen and sometimes I forget about this when releasing. 🙈 But still the startup time improvement is quite nice. :)

hmm. wrt 0.27.0, it might be an error on my side. i manually built both versions of wasmi locally as: https://github.com/yamt/toywasm/blob/master/benchmark/notes.md#wasmi

yamt avatar May 31 '23 15:05 yamt

Ah I thought your were simply installing wasmi via cargo install wasmi_cli. The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install. For wasmi version 0.30.0 I made sure the optimizations are applied when installing via cargo install. :)

wasmi is heavily depending on lto="fat" as well as codegen-units=1 optimization configs. Without them wasmi performance easily drops by 100% in some cases (or even more in others). I just checked the wasmi Cargo.toml and it seems that if you are building wasmi like this then these optimizations are almost certainly not applied. I should probably change the default --release build here but I was not expecting people to build wasmi from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi so I usually only enable them for benchmarks or releases.

Robbepop avatar May 31 '23 15:05 Robbepop

Ah I thought your were simply installing wasmi via cargo install wasmi_cli. The Cargo bug that not all optimizations are properly applied mostly affects certain binaries installed via cargo install. For wasmi version 0.30.0 I made sure the optimizations are applied when installing via cargo install. :)

things like cargo install go install etc scare me a bit. :-)

wasmi is heavily depending on lto="fat" as well as codegen-units=1 optimization configs. Without them wasmi performance easily drops by 100% in some cases (or even more in others). I just checked the wasmi Cargo.toml and it seems that if you are building wasmi like this then these optimizations are almost certainly not applied. I should probably change the default --release build here but I was not expecting people to build wasmi from sources. My fault. The default is without those optimizations enabled since they significantly increase the build time of wasmi so I usually only enable them for benchmarks or releases.

it reminded me that, while i wanted to use lto=full for toywasm, cmake insisted to use lto=thin. https://github.com/yamt/toywasm/blob/9c88f24924b8249ad259dbaf62239855dabd219f/lib/CMakeLists.txt#L83-L87

in the meantime, i added a warning about this: https://github.com/yamt/toywasm/commit/9ee47bf2a2cb5d19122b4944d476f08f38f1a531

yamt avatar May 31 '23 16:05 yamt

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

So instead of --release you do --profile bench. :) Made a pull request to your benchmark docs so it is documented: https://github.com/yamt/toywasm/pull/39

Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.

Robbepop avatar Jun 01 '23 06:06 Robbepop

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

So instead of --release you do --profile bench. :) Made a pull request to your benchmark docs so it is documented: #39

thank you. i commented in the PR.

Ever thought of writing a blog post with all your benchmarks about Wasm runtimes? :D Seems like you could pull off quite a bit of information there.

i have no interest in blog right now.

yamt avatar Jun 01 '23 11:06 yamt

If you want to benchmark wasmi with full optimizations and build it from sources you can build it via:

cargo build --profile bench

i rerun with it and pushed the results.

i also updated the procedure for wasmer. (it was not clearing cache as i intended.)

yamt avatar Jul 23 '23 12:07 yamt

Hi @yamt , thanks a lot for updating me about this!

The new wasmi results look much more as I would expect, being roughly twice as slow as Wasm3.

It is interesting that your toywasm has similar startup performance to WAMR classic interpreter but performs way better than it at runtime. What are your plans forward with toywasm?

Btw.: I am currently working on a new engine for wasmi making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)

Robbepop avatar Jul 23 '23 13:07 Robbepop

Hi @yamt , thanks a lot for updating me about this!

The new wasmi results look much more as I would expect, being roughly twice as slow as Wasm3.

good.

It is interesting that your toywasm has similar startup performance to WAMR classic interpreter but performs way better than it at runtime. What are your plans forward with toywasm?

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run. i haven't investigated further. probably i should, sooner or later.

Btw.: I am currently working on a new engine for wasmi making it more similar to how Wasm3 and the WAMR fast-interpreter work internally. Looking forward to how it performs when it is done in a few weeks/months. :)

interesting. is it this PR? https://github.com/paritytech/wasmi/pull/729

yamt avatar Jul 24 '23 12:07 yamt

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.

No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.

Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.

is it this PR? https://github.com/paritytech/wasmi/pull/729

Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.

Robbepop avatar Jul 24 '23 13:07 Robbepop

actually, it seems that which toywasm or iwasm classic is faster depends on the specific apps to run.

No runtime can be efficient for all the use cases - at least that's what I learned from working on wasmi. The only hope for a general purpose runtime is to fix all the potential weak spots so that it at least isn't terrible with any potential use case.

sure.

being considerably slower than a similar engine (in my case iwasm classic) for a specific app is likely a sign of weak spots, or even a bug.

Due to your awesome benchmarks I see a huge potential in lazy Wasm compilation for fixing one such weak spot for startup time since a few benchmarked runtimes profit quite a bit because of their lazy compilation and/or Wasm validation.

i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.

is it this PR? paritytech/wasmi#729

Yes it is. Although still super WIP at this point. Everything is subject to change. Still trying to figure out best designs for tackling certain problems/challenges. Trade-offs here and there, I just hope all the work will be worth it in the end. I was trying to read code from Wasm3 and WAMR fast interpreter for inspiration for certain problems but in all honesty I find both rather hard to read. Not used to reading high-density C code.

a lot of interesting ideas in the PR. i'm looking forward to see how it performs.

yamt avatar Jul 25 '23 15:07 yamt

i wonder how common sparsely-used wasm modules like ffmpeg.wasm are.

I can only talk for the use cases of my employer. We use Wasm in two different ways:

  • Executing the hot-patchable plugin-like runtime of entire system. This is to allow our users to customize their frameworks. This is usually a longer running process and it is likely that a good chunk of the available functionality is going to be executed eventually. We use Wasmtime as the executor.
  • Executing smart contracts. Due to the associated cost model developers of smart contracts are rewarded by doing as little as possible upon a call to a smart contract. Therefore a single smart contract execution usually only uses fraction of the entire Wasm blob for its execution, mostly executing a single function. Persistent data is loaded from and stored to the blockchain. Here we use wasmi.

Robbepop avatar Jul 26 '23 10:07 Robbepop

very interesting. thank you for sharing use cases.

Executing smart contracts. Due to the associated cost model developers of smart contracts are rewarded by doing as little as possible upon a call to a smart contract. Therefore a single smart contract execution usually only uses fraction of the entire Wasm blob for its execution, mostly executing a single function. Persistent data is loaded from and stored to the blockchain. Here we use wasmi.

the blob size itself doesn't cost there unless it's actually executed?

yamt avatar Jul 27 '23 12:07 yamt