wasmi Optimization: Bytecode dispatching via function pointers

Currently wasmi bytecode is organized as a simple enum for every wasmi bytecode operator. A subset of those operators look like the following:

pub enum Instruction {
    GetLocal { local_depth: u32 },
    Br(Target),
    BrTable { len_targets: usize },
    Unreachable,
    Return(DropKeep),
    Call(FuncIdx),
    I32Load(Offset),
    Const(UntypedValue),
    I32Add,
    // etc ...
}

When executing a wasmi bytecode function we simply traverse its internal slice of Instruction and have a giant match over all Instruction variants and then dispatch to the respective execution handler. This match procedure could be removed by embedding the execution handler directly into the wasmi bytecode via function pointers. For this to work we need to split Instruction into a handler and data part where handler is the function pointer to the handler function and data is a generic data enum.

The problem with such a generic data enum is that we'd end up with just another match and would win nothing in the end. Therefore we actually want or need a generic data union. While this is unsafe we can interpret this generic data union as the correct data in the respective handler function.

Therefore such as new wasmi bytecode Instruction type would look something like this:

/// Generic set of parameters for `wasmi` bytecode instructions.
///
/// The respective instruction dispatch handler is responsible to
/// select the correct instruction parameter variant out of the `union`.
#[derive(Copy, Clone)]
union InstructionParams {
    /// Used by any instruction that does not have inputs.
    none: (),
    /// Used by `local.get`, `local.set` and `local.tee`.
    local_depth: u32,
    /// Used by `br`, `br_if_eqz` and `br_if_nez`.
    target: Target,
    /// Used by `return`.
    drop_keep: DropKeep,
    /// Used by `br_table`.
    len_targets: usize,
    /// Used by `call`.
    func_idx: FuncIdx,
    /// Used by `call_indirect`.
    signature_idx: SignatureIdx,
    /// Used by `global.get` and `global.set`.
    global_idx: GlobalIdx,
    /// Used by all the linear memory `load` and `store` operations.
    offset: Offset,
    /// Used by `{i32,i64,f32,f64}.const`.
    value: UntypedValue,
}

/// Handles instruction disptach.
type InstructionHandler = fn(&InstructionParams) -> Result<(), TrapCode>;

/// A `wasmi` instruction consists of its dispatch handler and its parameters.
pub struct Instruction {
    /// The instruction dispatch handler.
    handler: InstructionHandler,
    /// The parameters of the instruction if any.
    params: InstructionParams,
}

Drawbacks

This will cause the Instruction type to require more stack space which could in the worst case even degrade performance. Also this approach will introduce plenty of unsafe Rust code, however, we can encapsulate this unsafe code pretty well into those handler functions since it is mainly required for selecting the respective InstructionParams variants out of the union set of possibilities.
It is not yet clear to me how we can handle the very fast dispatch handling for instructions that may bail out of the hot instruction scheduling such as call, call_indirect and return. Currently they are inlined into the big match expression which leads to very efficient codegen, however, due to the generic interface enforced by the InstructionHandler function pointer we can no longer return arbitrary values in these handler easily which makes this pattern impossible. More elaborate thinking needed here. If we had guaranteed tail call optimization in Rust we could solve this problem using tail call based instruction scheduling.

Advantages

Once Rust supports guaranteed tail call optimizations this new wasmi bytecode format would make it very simple to use a tail call based dispatching strategy which is known to be superior to the current dispatching routines.
This paper from the WAMR authors shows a 7% performance improvement using this dispatching technique.

Sep 05 '22 08:09 Robbepop

This GitHub Gist shows an example of how tail calls could be achieved using this dispatch architecture.

Copying this code snippet over to Godbolt Compiler Explorer shows that Rust compiles this down to tail calls when using --release.

Sep 05 '22 14:09 Robbepop

Another reason why this optimization might yield useful results is that our current bytecode definition that is based on an enum is 16 bytes in size where 8 of those 16 bytes resemble the tiny enum discriminant which could ideally fit in just a single byte. Therefore we have a very wasteful overhead of 7 bytes in total per wasmi instruction.

If instead of enum discriminants we store function pointers directly to the handlers of the respective wasmi instruction we'd still use the same 16 bytes per instruction but could remove one indirect branch via the giant match in the interpeter's hot path to dispatch to the respective wasmi instruction handler since the pointer is already given directly.

This WAMR paper talks about this optimization in particular for the WAMR Wasm fast interpreter and reports significant speed-ups.

Oct 02 '22 14:10 Robbepop

I created this proc. macro crate that generates code according to the idea of this issue: https://github.com/Robbepop/union-fn There are also some interpreter based benchmarks to compare performance: https://github.com/Robbepop/union-fn/tree/main/benches

The benchmarks results are sad: https://github.com/Robbepop/union-fn/pull/4

The clear indication is that this trick will only work (if at all) if Rust supports guaranteed tail calls. Without guaranteed tail calls we will always end up with better performance with our current simple loop+match instruction dispatch technique.

Jan 16 '23 19:01 Robbepop

I usually do not link to youtube videos but this one is really foundational to optimization of interpreter logic and I want to conserve it by referncing the talk in this issue: https://www.youtube.com/watch?v=EIVZwHrJLLk

The presenter clearly demonstrates why we ideally want to have guaranteed tail calls in Rust.

Jan 31 '23 19:01 Robbepop

The approach proposed by this issue is also called direct-threaded interpreter architecture. My own research as well as some (old) scientific papers suggest that this approach is not superior to so-called indirect-threaded interpreter dispatch architecture. Therefore I think it is okay to close this issue. If new findings indicate that this conclusion might be incorrect, we can simply reopen.

May 23 '23 13:05 Robbepop

wasmi wasmi copied to clipboard

Optimization: Bytecode dispatching via function pointers

Drawbacks

Advantages

wasmi
wasmi copied to clipboard