wasmi
wasmi copied to clipboard
Implement fused instructions
Note: We should be tackling https://github.com/paritytech/wasmi/issues/361 before implementing this issue.
Currently wasmi more or less directly translates from the given Wasm to a one to one representation of wasmi bytecode.
So a Wasm local.get will be translated one to one to a wasmi local.get instruction.
This old but timeless paper tells how the majority of runtime overhead of efficient interpreters (what wasmi is according to the paper) is dictated by the instruction dispatch routines.
Instruction dispatch performance can be improved by many different means.
For example, using direct threaded code architecture instead of a switch based interpreters such as used by wasmi's current instruction dispatch.
However, threaded code architecture requires features not available to Rust at the moment of writing this issue. (2022-01-11)
However, it is also possible to reduce the instruction dispatch overhead simply by reducing the amount of instructions required to model the same program behavior. This is possible by fusing instructions together and forming so-called super instructions.
For wasmi the main challenge is going to be how we can implement this efficiently so that it does not introduce too much overhead during Wasm compilation. This is because we mainly use wasmi for smart contract execution where we expect many small executions and want to execute them as quickly as possible. We are required to compile a Wasm based smart contract before we can execute it, therefore Wasm compilation is important to be kept as efficient as possible. It belongs to the total runtime.
Example
For example if we take the following sequence of Wasm instructions:
local.get 0
i32.const 1
i32.add
local.set 0
We can see that this basically represents what would be written in Rust like:
x += 1;
Where x is represented by the Wasm local variable with index 0.
Ideally we could fuse sequences like these into one super instruction.
For the above sequence of instructions we could for example introduce the following super instruction for wasmi bytecode:
local.i32_inc_const ($local_id 0) ($amount 1)
Which takes a $local_id for the local variable to increment and a constant $amount of type i32.
If we were to successfully replace the above sequence of 4 instructions to this single instruction we reduced the amount of instructions in this example by 75% and therefore we can expect a significant runtime performance improvement.
Problems
Since we want to fuse instructions efficiently we should try to use an architecture with a constant peek.
A peek of 1 would be able to look ahead one Wasm instruction at a time before outputting a wasmi instruction.
The higher the peek amount the more overhead for this routine we can expect. Theoretically for the above instruction that fused 4 instructions together we'd needed peek(3).
I'd argue that using an efficient peek(1) is probably enough if we very carefully build a pyramid of fused instructions.
For example we could define the following super instructions to the wasmi bytecode set:
local.get_and_const $local_id $const_value: That pushes a local variable and a constant to the value stack.i32.add_local_const $local_id $const_value: That pushes the addition of the local variable and a constant value to the value stack.local.i32_inc_const ($local_id 0) ($amount 1): As mentioned earlier.
With those super instruction set in place we could afford a peek(1) architecture that is still able to optimize the sequence of instructions into the single last instruction by fusing instructions eagerly and keeping them still around for potential additional fuses. It is instruction fuses all the way.
Expected Outcomes
The aforementioned paper says that performance improvements of 2x-3x are likely with proper super instruction optimization for efficient interpreters. We should carefully implement and benchmark this feature to test if we can get close to numbers like these. Also we should definitely make sure that our Wasm compilation time stays as efficient as it was.
Found the architecture description of the Wasmtime WAMR (Wasm Micro Runtime) interpreter and its optimizations. It also has fused instructions although a bit different that stated in the first post. They do a transformation of the bytecode that is efficient and produces a register-based bytecode from the original stack based Wasm bytecode. The performance improvements seem to be very promising.
Link: https://www.intel.com/content/www/us/en/developer/articles/technical/webassembly-interpreter-design-wasm-micro-runtime.html
I have conducted some research on instruction dispatch here: https://github.com/Robbepop/interpreter-dispatch-research/tree/main/src/fused
So far it is extremely frustrating (as usual with research) since the most efficient technique seems to be the goo'old switch-loop dispatch that we are currently using in wasmi. Only the tail-call-dispatch can be as efficient with proper optimizations, however, tail call elimination is not guaranteed in Rust so we can screw that.
With this information in head, I tried to go further with my research as to how to best fuse instruction sinks and sources. Sink's can be globals, registers and eventually even linear memory regions and sources can be sinks or constants (immediates). For example:
add <res> <lhs> <rhs>
Here <res> is the sink and it can be either a register, a global variable or a linear memory region.
And both <lhs> and <rhs> are sources and can be either a register, a global variable, a linear memory region or a plain old constant.
Even if we left out the linear memory regions as options we end up with 2_3_3 = 18 combinations of how the add instruction can work.
Why are we doing this? This allows us to fuse instructions. E.g.
local.get 0
i32.const 1
i32.add
global.set 1
Could be represented as a single instruction:
add (global 1) (local 0) (const 1)
We can either what to do at instruction execution OR we offload this at module compile time. I tested both as well as a middle-ground solution between both.
Result is that the variant where we do the decision at instruction execution is a lot faster than the other two. Especially the middle-ground solution was super slow. However, even the fastest solution that decides upon instruction dispatch is approx 25% slower than just loading everything from registers and therefore not fusing instructions.
This issue got superseeded by https://github.com/paritytech/wasmi/issues/361 since https://github.com/paritytech/wasmi/issues/361 already indirectly fuses Wasm instructions by translating them to register machine bytecode. Also https://github.com/paritytech/wasmi/issues/361 has some neat optimizations during translation from Wasm to wasmi bytecode which also may fuse instructions.