wasmtime Out-of-band fuel metering

Out-of-band fuel metering

Open pepyakin opened this issue 2 years ago • 12 comments

This is a prototype of the solution for https://github.com/bytecodealliance/wasmtime/issues/4109 [^1]. This is very rough and is not intended to be landed as is. Rather, this PR here is to validate that this approach is sensible and to source some preliminary feedback.

[^1]: I decided to change the name from slacked metering. First, I wanted to avoid mentioning async, so that's why it's not async fuel metering, since that async is orthogonal to the async currently used in wasmtime (as in Func::call_async). At the same time I don't know if slacked conveys the meaning (English is not my mother tongue). So I figured that that "out-of-band metering" is a better name. I contract it to outband in code, I assume it's fine since I saw people using it elsewhere. Please let me know if you have a better name!

Introduction

The regular fuel metering holds the fuel in the vmctx->rumtime_limits->fuel_consumed. At the beginning of each function, the value is loaded into a local variable. Roughly every basic block the value is increased with the cost of that basic block. The value is checked for overflow at function entries and loop headers. If fuel overflowed, then a certain libcall handles it. Before leaving a function (normally, through a call or before a trap), the fuel is dumped into the VMRuntimeLimits.

WIth the out-of-band fuel metering, the fuel is now promoted to a dedicated register tapping to the pinned_reg cranelift feature. The value is still increased every basic block, but the value does not leave the pinned register within wasm. Only at the wasm-host boundaries, i.e. trampolines & libcalls (not implemented as of this PR, coming later), is the fuel value loaded in or flushed from the pinned register into the VMRuntimeLimits.

Also, no checks are performed in the wasm. The checks are meant to be performed either when crossing the wasm-host boundary or asynchronously. Specifically, on Linux, the check is performed by sending a signal each, e.g., 1ms. The signal handler checks if the signal came from the wasmtime (on a best effort basis) and if the program counter points at some the JIT code. If it does, then that means the pinned register holds the currently consumed fuel value. If the fuel value is overflown, we bail out unwinding the wasm stack.

This kind of mechanism showed a great improvement in performance on our tests while still being deterministic as long as the in-wasm state is irrelevant in the case of the OOG.

Now, the prototype here right now targets x86_64 Linux. There is a plan to support aarch64 and macOS. Windows should also be possible to implement. The prototype does not support async. It would be great to support it, but additional work is required.

Implementation Notes and Rationale

Mutex

Right now, before entering we save the tid of the calling thread. This is because theoretically the store can be called from different threads. I also wanted to prepare for the async: potentially the future can be polled on any thread, with each fiber switch we can find ourselves on a new thread.

The problem is with Linux, it turns out that sending signals is a bit of a hassle. The signal's sender cannot know if the destination thread is dead or alive. Moreover, the tid can theoretically be reused and thus a signal could be sent to the wrong thread.

At first, I thought it might be a problem performance-wise, but now I don't think so. The reason is: that the mutex does not get too contended. The mutex is taken on wasm entry & exit and also during the out-of-band fuel check. The latter also uses try_lock. The interesting case is when the wasm tries to exit to host but the mutex is held: in that case, the exit will be delayed until the out-of-band check request is finished.

`rt_tgsigqueueinfo`

I resorted to using a raw syscall rt_tgsigqueueinfo on Linux to send the signal.

I thought about using pthread_sigqueue (in constrast to just pthread_kill) because it allows to send a sival. This is helpful to tell if the signal is coming from wasmtime or not. However, turns out that at least glibc does a bunch of syscalls that we probably don't want to have inside of the out-of-band fuel check request. So I decided to go straight for rt_tgsigqueueinfo. It takes the siginfo_t but it seems like the kernel does not use that and passes it as is, so I used this opportunity pass dummy values.

Another potential problem that I am not sure needs to be tackled: the pid is cached during the creation of the out-of-band check handle. This is not entirely correct since theoretically, it can change, but I figured it does not warrant worrying.

Future Work

If this gets a green light, then several things will need to be done in the future:

As I mentioned above it should work on other platforms, namely aarch64, macOS, and possibly Windows.

Then, make it work with async. That is actually a bit tricky. The main problem revolves around handling the yields. Specifically with Linux, if a signal handler interrupted the wasm code and figured it was OOG, it should yield the execution. Not sure how good is the idea to switch fibers from inside a signal handler. With macOS/Windows it's not any better: the check thread should manipulate the target thread so that it's possible to switch the fiber.

In case, that works, we can think of applying the same technique we use here for a high-performance substitution for the epoch interrupts.

Big thanks to @alexcrichton & @cfallin for their great support !

Jul 19 '22 17:07 pepyakin

Subscribe to Label Action

cc @peterhuene

This issue or pull request has been labeled: "wasmtime:api", "wasmtime:config"

Thus the following users have been cc'd because of the following labels:

peterhuene: wasmtime:api

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

Jul 19 '22 18:07 github-actions[bot]

Label Messager: wasmtime:config

It looks like you are changing Wasmtime's configuration options. Make sure to complete this check list:

[ ] If you added a new Config method, you wrote extensive documentation for it.

Our documentation should be of the following form:

Short, simple summary sentence.

More details. These details can be multiple paragraphs. There should be
information about not just the method, but its parameters and results as
well.

Is this method fallible? If so, when can it return an error?

Can this method panic? If so, when does it panic?

# Example

Optional example here.

[ ] If you added a new Config method, or modified an existing one, you ensured that this configuration is exercised by the fuzz targets.

For example, if you expose a new strategy for allocating the next instance slot inside the pooling allocator, you should ensure that at least one of our fuzz targets exercises that new strategy.

Often, all that is required of you is to ensure that there is a knob for this configuration option in wasmtime_fuzzing::Config (or one of its nested structs).

Rarely, this may require authoring a new fuzz target to specifically test this configuration. See our docs on fuzzing for more details.
[ ] If you are enabling a configuration option by default, make sure that it has been fuzzed for at least two weeks before turning it on by default.

To modify this label's message, edit the .github/label-messager/wasmtime-config.md file.

To add new label messages or remove existing label messages, edit the .github/label-messager.json configuration file.

Learn more.

Jul 19 '22 18:07 github-actions[bot]

Thank you for taking a look!

Yes, @alexcrichton, this PR is very raw right, way much less that I prefer right now. Sorry about that. I was eager to create it because of two reasons:

I wanted to check if the general approach is what we want to,
the actual trigger, was that I discovered Nick's PR #4431.

Specifically, to give the context for this question https://github.com/bytecodealliance/wasmtime/pull/4431#discussion_r924804070. I was scared that it made this PR impossible or very hard to implement.

Right now, I am rebasing this PR onto Nick's and trying to figure out what to do with the trampolines. The best I came up with so far is under the spoiler below. I hope my programming license won't be revoked.

struct VMRuntimeLimits {
    // ...

    /// A flag that indicates if the out-of-band fuel metering is enabled.
    pub outband_fuel: UnsafeCell<bool>,

    /// host-to-wasm return address.
    pub orig_entry_trampoline_return_address: UnsafeCell<usize>,

    /// wasm-to-host return address.
    pub orig_exit_trampoline_return_address: UnsafeCell<usize>,
}

// ...

asm_func!(
    "host_to_wasm_trampoline",
    "
        .cfi_startproc simple
        .cfi_def_cfa_offset 0

        // Load the pointer to `VMRuntimeLimits` in `r10`.
        mov r10, 8[rsi]

        // Check to see if this is a core `VMContext` (MAGIC == 'core').
        cmp DWORD PTR [rdi], 0x65726f63

        // Store the last Wasm SP into the `last_wasm_entry_sp` in the limits, if this
        // was core Wasm, otherwise store an invalid sentinal value.
        mov r11, -1
        cmove r11, rsp
        mov 40[r10], r11

        // Check if the out-of-band fuel metering is enabled. If it is not enabled, skip to the
        // tail call directly.
        cmp QWORD PTR 48[r10], 0x00
        je host_to_wasm_trampoline__perform_tail_call

        // OK, here we know that the out-of-band fuel metering is enabled. 
        //
        // Dump the `fuel_consumed` into the fuel register.
        mov r15, 8[r10]

        // Next, we have to do some shenanigans. Below we call into the wasm with a tail call, to
        // not mess with the stack layout. But at the same time, after returning from wasm, we
        // have to our fuel register r15 back into `fuel_consumed` of `VMRuntimeLimits`. To do that
        // we stash the original return address (which is currently located at `[rsp]`) that points
        // back to the caller from the host side and change it with the continuation defined 
        // below. The continuation will dump the r15 back into the memory and unstash the original
        // return address and jump at it.

        mov r11, [rsp]
        mov 56[r10], r11
        lea r11, [rip+host_to_wasm_trampoline__cont]
        mov [rsp], r11

        host_to_wasm_trampoline__perform_tail_call:

        // Tail call to the callee function pointer in the vmctx.
        jmp 16[rsi]

        host_to_wasm_trampoline__cont:

        // Here we're in a inconvenient situation, where the callee's frame is popped and all the
        // arguments are gone. We have to improvise our way out of this situation.

        // Open a temporary stack frame.
        //
        // The frame layout:
        //
        // rsp +  0: The original return address. Written by `wasmtime_outband_fuel_on_entry`.
        // rsp +  8: r15
        // rsp + 16: rax
        // rsp + 24: padding
        // rsp + 32: xmm0
        sub rsp, 48
        mov 8[rsp], r15
        mov 16[rsp], rax
        movdqu XMMWORD PTR 32[rsp], xmm0

        // Finally, move the current rsp into the first argument.
        mov rdi, rsp
        call wasmtime_outband_fuel_on_entry

        // Load the original return address into r10
        mov r10, 0[rsp]

        // Restore the initial values of rax and xmm0.
        mov rax, 16[rsp]
        movdqu xmm0, XMMWORD PTR 32[rsp]

        // Release the allocated stack area.
        add rsp, 48

        // Jump at the original return address.
        jmp r10
        
        .cfi_endproc
    ",
);


/// Dumps fuel into the `VMRuntimeLimits->fuel_consumed` and unstashes the original return address
/// for the trampoline.
///
/// Only used if out-of-band fuel is used.
#[no_mangle]
pub unsafe extern "C" fn wasmtime_outband_fuel_on_entry(sp: *mut usize) {
    use crate::traphandlers::tls;
    tls::with(|info| {
        let info = match info {
            Some(info) => info,
            None => {
                // Entering trampoline must set the TLS.
                std::hint::unreachable_unchecked();
            }
        };

        #[repr(C)]
        struct StackArgs {
            return_address: usize,
            fuel: i64,
        }

        let args = sp as *mut StackArgs;

        (*args).return_address = *(*info.runtime_limits())
            .orig_entry_trampoline_return_address
            .get();
        *(*info.runtime_limits()).fuel_consumed.get() = (*args).fuel;

        if (*args).fuel > 0 {
            // Fuel ran out and we bail through unwinding.
            crate::traphandlers::raise_lib_trap(wasmtime_environ::TrapCode::Interrupt);
        }
    });
}

This seems to be working and I hope I am not missing anything. I am in the process of replicating the same construction for the wasm-to-host and libcall trampolines.

You can see there it should address the concerns you expressed wrt not checking the fuel. To be clear, the trampoline generation in compiler.rs will also be removed, since the common variadic asm trampoline (what's a better name to distingiush it from the ones generated in compiler.rs?) takes care of that.

I ack the other notes. I hope the next push will feature not as raw code as this one. Also will run the benchmarks on it.

Jul 22 '22 16:07 pepyakin

Oh no worries! I mostly want to make sure I'm clear on expectations and such, it's totally fine to push up work-in-progress PRs.

Also wow that is some gnarly assembly. I... can see how it might work though! I suspect one day we're going to need to severly improve our trampoline story, even with "just maintain frame pointers" it was already complicated enough... In any case though not a blocker for this work, just something for us to ponder.

Jul 22 '22 19:07 alexcrichton

One thing I just thought of I want to note before I forget about, right now "is this a wasm pc" is pretty crude where it just checks if the instruction pointer is in the code section of a loaded module (this is preexisting in Wasmtime) but I think that we'll want to enhance that because if an interrupt happens during an entry trampoline we can't guarantee that the fuel register is configured yet, so we'll want to make sure that the "is this a wasm pc" check may exclude trampolines.

Jul 25 '22 15:07 alexcrichton

Oh, yes, definitely. On top of that, the query should be a bit more smarter. It should only return true if pc is in a wasm module that defines r15 as fuel, not just any module. That should come later.

I've been battling with the trampolines for quite some time now, and it does not seem good.

The problem is I am not sure how to make the system unwinder work through this trampoline. So at first, I was getting the SIGABRT from the unwinder since it couldn't find the CFA. I fixed that by replacing cfi_def_cfa_offset 0 with .cfi_def_cfa rsp, 0. That kind of worked, but then the unwinder started getting stuck in the infinite loop.

I fixed that by slapping .cfi_undefined 16 before the tail call. That worked but, expectedly, now the unwinder can't get past the trampolines, and the stack traces (e.g. created by anyhow) are chopped.

We need to annotate where we get the original return address to fix that. The return address is stored behind TLS. There is no way, we can come up with a DWARF expression that fetches it from the VMRuntimeLimits.

Jul 25 '22 15:07 pepyakin

Ah, regarding the fuel register being initialized, I think it should already work. Here we ask for the DefinedFunctionIndex which is available only for functions defined by a wasm module. If pc falls on a trampoline that would return None.

Jul 25 '22 16:07 pepyakin

I'm pretty sure our dwarf unwind information is not "async correct" where every single ip in a function guarantees a correct backtrace. With the fast-stack-walking branch and fp-based unwinds that may get better though.

Jul 25 '22 21:07 alexcrichton

I'm pretty sure our dwarf unwind information is not "async correct" where every single ip in a function guarantees a correct backtrace. With the fast-stack-walking branch and fp-based unwinds that may get better though.

Oh, sure, I remember that. It should be fine though since I was not planning on capturing the backtraces in the fuel interrupts, and I still keep in mind the landing pads idea.

The problem I am dealing with now is not about async traps. I should've been clearer on this. That SIGABRT would happen during capturing the stack trace iff the trampoline on the stack, and it went through the return address swizzling of the continuation.

I came up with a standalone minimized version of the problem which can be found under the spoiler:

// extern crate backtrace;

use std::arch;
use std::mem;

arch::global_asm!(
    "
    .p2align 4
    .global trampoline
    .type trampoline, @function

    trampoline:

    .cfi_startproc simple
    .cfi_def_cfa rsp, 0

    // move RDI, the original function, into r11
    mov r11, rdi

    // the return address pushed by the `call` instruction calling this trampoline
    // is located at *rsp. Move the value to rdi.
    mov rdi, [rsp]

    // Then, replace the return value with the continuation at the label `cont`.
    lea r10, cont[rip]
    mov [rsp], r10
    
    //
    // Important!! 
    //
    // Mark the register 16, aka the return address, as undefined. That stops the unwinder here.
    // If that was not here, the unwinder would fall into an infinite loop.
    //
    // An unfortunate side-effect is the stack traces are also chopped here.
    .cfi_undefined 16

    // Finally, perform the jump.
    jmp r11

    cont:

    jmp rax

    .cfi_endproc

    .size trampoline, .-trampoline
    "
);

extern "C" {
    fn trampoline();
}

type Func = extern "C" fn(*const (), usize) -> *const ();

#[inline]
unsafe fn prepare_trampoline(dest: Func) -> (*const (), Func) {
    let dest = dest as *const Func as *const ();
    let func = mem::transmute_copy(&(trampoline as usize));

    (dest, func)
}

fn main() {
    unsafe {
        let (dest, f) = prepare_trampoline(host_func);
        f(dest, 48);
    }
}

extern "C" fn host_func(rdi: *const (), rsi: usize) -> *const () {
    println!("{}", rsi);
    let mut count = 0;
    backtrace::trace(|frame| {
        let ip = frame.ip();
        println!("{:X}", ip as usize);

        if count == 100 {
            // likely we stuck in a loop. Break into the debugger.
            unsafe {
                arch::asm!("ud2");
            }
        }
        count += 1;

        true
    });

    // we don't have such luxury in the real thing.
    rdi
}

If you remove the .cfi_undefined 16 directive, the unwinder would get stuck in an infinite loop. If you remove the .cfi_def_cfa rsp, 0 directive, or even change to .cfi_def_cfa_offset 0 (Does this require attention in https://github.com/bytecodealliance/wasmtime/pull/4431?) found in the original version, then you would get SIGABRT:

* thread #1, name = 'cfi-dwarf', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff7e3abaa libc.so.6`raise + 202
    frame #1: 0x00007ffff7e25523 libc.so.6`abort + 255
    frame #2: 0x00007ffff7dd5b40 libgcc_s.so.1`uw_update_context_1 + 1008
    frame #3: 0x00007ffff7dd5cd1 libgcc_s.so.1`uw_update_context + 17
    frame #4: 0x00007ffff7dd68bd libgcc_s.so.1`_Unwind_Backtrace + 93
    frame #5: 0x000055555555cfa1 cfi-dwarf`backtrace::backtrace::trace_unsynchronized::hf1601be64f8e51c8 [inlined] backtrace::backtrace::libunwind::trace::h80afc20a98e10b83(cb=&mut dyn core::ops::function::FnMut<(&backtrace::backtrace::Frame), Output=bool> @ 0x00007fffffff9d58) at libunwind.rs:93:5
    frame #6: 0x000055555555cf8c cfi-dwarf`backtrace::backtrace::trace_unsynchronized::hf1601be64f8e51c8(cb={closure_env#0} @ 0x00007fffffff9d40) at mod.rs:66:5
    frame #7: 0x000055555555d038 cfi-dwarf`backtrace::backtrace::trace::h4ba2f0e531a2363e(cb={closure_env#0} @ 0x00007fffffff9da0) at mod.rs:53:14
    frame #8: 0x000055555555d1d3 cfi-dwarf`cfi_dwarf::host_func::h89c8978f2a6c15a6(rdi=0x000055555555d148, rsi=48) at main.rs:74:5

Jul 26 '22 09:07 pepyakin

Oh right yeah with the trampoline you gisted above I was wondering how backtrace info would work out and it seems like it doesn't. I don't know how to solve that myself, the trampoline there is quite gnarly.

Jul 26 '22 16:07 alexcrichton

Oh right yeah with the trampoline you gisted above I was wondering how backtrace info would work out and it seems like it doesn't. I don't know how to solve that myself, the trampoline there is quite gnarly.

If wasmtime-flavoured ABIs enshrined the caller vmctx and guaranteed it so that it is still usable after the callee returns, it would've made things simpler: no need for reaching to tls for the original return address. The dwarf expression would require a few dereferences but otherwise seems to be doable.

If you (or anyone from BA) think that's acceptable, I can try to research this implementation strategy.

Otherwise, I think I ran out of ideas how to make this work.

Jul 26 '22 16:07 pepyakin

Yeah unfortunately I don't have a great answer myself on that. I feel like with this and the stacks PR our trampoline story is pretty lacking, but I don't know how best to rectify it. Short of that it may be that a pinned register won't work at all until we figure out a better story for trampolines. Ideally all the pinned register/fuel stuff would be part of cranelift-compiled trampolines and we wouldn't have to tinker with global_asm! blobs at all. We're definitely not in a place to easily be able to do that right now though.

Jul 26 '22 19:07 alexcrichton

wasmtime wasmtime copied to clipboard

Out-of-band fuel metering

Introduction

Implementation Notes and Rationale

Mutex

rt_tgsigqueueinfo

Future Work

Subscribe to Label Action

Label Messager: wasmtime:config

wasmtime
wasmtime copied to clipboard

`rt_tgsigqueueinfo`