lucet Exposing a fast version of Instance::run that reduces number of checks

Context This is part of a series of bugs that I spoke to @tyler @pchickey about. We are currently using Lucet to sandbox libraries in C++ applications. The idea behind this is that using a wasm sandboxed version of the library allows ensuring that a memory safety issue in the library does not automatically result in a memory safety vulnerability in the full application. One of the consumers of this work is the Firefox web browser.

Problem Currently the function invocation API in Lucet Instance::run performs a lot of runtime checks that add a lot of overhead each time a WASM module function is invoked from the application. While the cost of isolation is the most obvious part of the overhead, we found that applications are often more sensitive to the transition cost of an invoke (consider the example of an application with a tight loop invoking a tiny function in the wasm library). Thus I wanted to start a conversation on optimizing the function invocation API.

Currently here are some of the sources of overhead in function invocation, in no particular order

Symbol resolution via get_export_func
Null pointer checks for the function pointer
State checks to see if the module has faulted
Function signature lookup and checking against argument lengths
Loops and branches in Context::init to set the correct register/spill according to the calling x64 convention
Retrieving/copying both integer (rax) and floating (xmm0) registers on every function call.

For the library sandboxing case, the above overheads can be removed without sacrificing safety. More descriptions on how/why are included below. To some extent I am trying to figure out

which of these techniques are useful outside the library sandboxing use case, and thus which would be useful to include in Lucet as the default approach
which of these are reasonable to expose via a set of unsafe API.

Here is a more full description of how and why we can eliminate the above overheads

Symbol resolution - In the library sandboxing case, there are some functions in the WASM module that are called in a loop. Therefore, exposing a version of Instance::run that takes a function pointer and skips symbol resolution (essentially similar to the private func Instance::run_func) would extremely useful, as we can perform the symbol resolution once and not pay the cost for each invocation
Function null pointer checks - for the library sandboxing case, we statically know the list of functions in the WASM module that will be invoked by the application. Thus the chances of the application performing a symbol resolution resulting in a null is close to 0. Even if this happens, it seems alright to allow the signal handler to account for this use case instead of adding an extra check inline.
State checks to see if the module has faulted - At least in the library sandboxing use case, the application and wasm sandboxed library effectively share a state of being "faulted" i.e. a fault in the wasm sandboxed effectively calls abort on the application. Thus there would never be a case where we call Instance::run when the module is faulted. It would be good to expose an API that optimizes for this use case as well.
Function signature lookup - for the library sandboxing case, we statically know the function being invoked. Additionally, since we are focusing on C++ applications, we have headers with function signatures. This effectively means, we can perform all the signature and type checking for function invocation completely statically via C++ templates and don't have to pay any runtime costs (if you are curious this is implemented here). In the future, if we want to extend this to rust, a similar approach is also feasible.
Correct register assign/spill (x64 convention), retrieving/copying both integer (rax) and floating (xmm0) registers on every function call - for the library sandboxing case, we statically know the function being invoked. Thus with C++ templates (or rust macros), we can generate optimized stubs for each function directly at the call site. I do have some hacky prototypes demonstrating some of this, which I can share if you are curious. This should eliminate a bunch of code from Context::init, the use of UntypedRetVal struct (which given that this is a struct with a float and int has to be transferred via the stack in the emitted code). Critically, this should also set the stage for generating an optimized context switch (Context::swap) per function. Note that an optimized Context::swap is probably a longer discussion. An optimized Context::swap is more sensitive to Cranelift bugs. Thus we may want to expose the current version of Context::swap just as a defense in depth measure. I'm happy to expand on this if needed.

Actions I think the first step is probably to figure out what would be a good design of a low level optimized invocation API. So if you have any thoughts/wanted to discuss some of the above points, that is perfect! If we can figure out a good design for the invoke API, I would happy to contribute a patch for implementing the same.

Sep 05 '19 02:09 shravanrn

Providing an alternate interface that doesn't do argument type checking has been on our list for a while, so getting an external request for it as well convinces me we should do it. Likewise, amortizing the cost of symbol lookup also would be straightforward by, as you say, exporting run_func.

However, I'm very skeptical that removing the null check and the state checks would make an appreciable performance difference to outweigh the peace of mind that these checks provide. I'm open to being proved wrong here.

Correct register assign/spill (x64 convention)

Does this mean we're doing something wrong with the ABI currently?

Thus with C++ templates (or rust macros), we can generate optimized stubs for each function directly at the call site. I do have some hacky prototypes demonstrating some of this, which I can share if you are curious.

I'd be curious to take a look at this, because I don't think I'm really following how this would work in a general-purpose runtime interface. This sounds like code would need to be generated for each specific function type that we call, which in our use case can't be known ahead of time.

Sep 05 '19 20:09 acfoltzer

However, I'm very skeptical that removing the null check and the state checks would make an appreciable performance difference to outweigh the peace of mind that these checks provide. I'm open to being proved wrong here.

Yup, I agree, I was mostly mentioning this to be thorough... In our performance testing, this is definitely something that is not that significant

Does this mean we're doing something wrong with the ABI currently? ... specific function type ... in our use case can't be known ahead of time.

Sorry, my language there was confusing. Everything with the ABI looks good. I just meant, since there is exactly one function that does this for all possible signatures, this function has to account for every possible thing - clear all registers, copy all parameter registers, copy all return registers etc. independent of what is used. However, from your description, it seems that you genuinely rely on this as you need the ability invoke functions whose signature is known only at runtime?

To be clear, I do agree that the generation of specialized stubs would definitely be outside the scope of lucet (I have a C++ template library to help with this), but I'm hoping that lucet can expose enough of the internals so that this becomes an option via external tools such as the aforementioned template library?

This sounds like code would need to be generated for each specific function type that we call

In short, yup! Let me make the case for why this is something we care about for library sandboxing (and perhaps is an additional use case that lucet should consider for the longer term). The downside to the one size fits all function stub both is overhead... Consider invocation of the following function

int getVal() { return 5; }

the runtime still has to copy all 6 integer parameter + floating point parameter registers, copy the floating point return registers. For small functions like the above, the relative overhead starts to get huge. This overhead can get even worse, if the function ABI from cranelift starts allowing parameters in the various SIMD registers as well, which means run_func now has to copy/clear each of these on every invocation. An observation on SFI tooling that I've seen repeatedly during the course of research into tools such as Native Client is that the overhead of one size fits all function invocation stubs increases over time, as the tooling starts to support more complex features of the underlying processor.

However, I believe these 2 approaches, general and specialized, can co-exist... In scenarios where the signature is known, the specialized version could be preferred as it significantly reduces invocation overhead, while the general purpose version can be used if the signature is known only at runtime.

Please let me know your thoughts!

Sep 06 '19 16:09 shravanrn

Gotcha, thanks for the extra context! It all makes sense now, and I can understand why you'd want that capability.

I'll do some more thinking on it, but at first blush I think supporting this would require quite a bit of surgery on the runtime in a way that I don't think we'd even be able to take advantage of from Rust. We could try to implement a macro to generate a version of run_func specialized to the number and types of arguments, but without Rust supporting inline assembly I don't think we'd actually be able to generate an optimized Context::swap() without really going deep into proc macros.

If we wanted to support your use case from C++, what kind of API do you think would be needed? I'm guessing you'd basically want one call to do everything in run_func up to the point where swap is called, and then another call to do everything after the swap. Splitting up the functions like this unfortunately either would require duplicating code or making is easier to misuse from Rust, but I would be open to proposed designs.

Sep 06 '19 23:09 acfoltzer

Thanks for the response!

At a high level, I think my goal here is

Identify pieces that can be customized for a function signature - those parts will be done from the C++ side, via a library that I provide
Identify low overhead primitives low level that can be provided by the lucet-runtime written in rust that both support point 1) above and abstract away those details that the lucet-runtime reserves the right to change

To this end, I figure one way for us to figure out a design here is for me to identify the lowest level of primitives I could want... If you could then have a look and let me know which of these primitives break an abstraction that the lucet-runtime wants to maintain, that would be great! I figure a couple rounds of this process would help us land on something useful? So here are some of the primitives that could work

API for "setup before"/"teardown after" an invoke

API to attach/detach signal handlers - This will allow the embedding process to figure out when it wants to do this... rather than on each function call
API to set lucet internals for executing functions - this include setting the Instance state and to running or stopped and the thread_local current_instance
API for fault cleanup - An API that basically exposes the code under State::Fault

API for the actual "function invocation" (i.e. setting params, getting returns, optimized swap)

So the starting point here is that I could actually do this completely in C++ with a combination of templates, some inline assembly and some assembly in .asm files.
This requires me to make the following assumption about lucet - "the first parameter is reserved and is the address of the heap"
While I am happy to go with this approach, here are the 2 concerns I have
1. It could break if the lucet ABI changes here - for instance a change to what is being sent as the first parameter, or maybe someday a change where the contents of 2 parameters are now fixed (instead of just the first). For the former style of change, should we consider an API such as get_invocation_fixed_param, so the C++ library can be transparent to any changes in the first parameter? Another simple non technical alternative could be if you drop a line my way when making this change to lucet, I'm happy to keep my C++ library up to date :smile:
2. The changes here won't benefit Rust code - Seems a bit of a shame to make this specialized to C++ only... So if we can figure out ways to share some of this code, that would probably be great. However, an alternate viewpoint could be to use this C++ only implementation as a prototype - and then once built, we could revisit how to bring this to rust

Splitting up the functions like this unfortunately either would require duplicating code

Yeah, that makes sense.. I am hoping that some of the APIs I proposed should only involve splitting up some existing functions and not duplication... Hopefully the rust inliner would remove any performance overhead of splitting the invoke into separate functions

Splitting up the functions like this ... is easier to misuse from Rust

I did want to clarify a potential miscommunication here... Or maybe we are already on the same page, in which case, please ignore the below :smile:

For the rust (and C++) use case, I do want clearly separate the tasks of

providing low level unsafe primitives
and a library that uses this primitives and compile time checks/codegen to have a safe optimized invocation stub

If we do add such primitives, these low level API primitives will not preserve safety if misused. Thus, I don't believe it should ever be "recommended best practice" when using the lucet runtime, to use these primitives.

Rather, in both C++ and rust, I think these API primitives are meant to be used by an intermediate library that using C++ templates/rust macros to perform the required checks and perform codegen to invoke these primitives correctly. Once these libraries exist, the "recommended best practice" depending could be as mentioned earlier

In scenarios where the signature is known, use the library that uses static checks and the low level primitives as it significantly reduces invocation overhead,
In scenarios where the signature is known only at runtime, use the safe general purpose version

The goal is that neither of this interfaces should be possible to misuse :smiley:

While I know this is possible in C++, I think you've raised some great points regarding the challenges of such a library in rust, ... It is on my radar to spend some time fully investigating this, but this is a few months away. I am optimistic that something should be possible here though.

As always, thoughts/comments welcome!

Sep 08 '19 00:09 shravanrn

Sorry this is all taking so long, but this part is implemented in #454

API to attach/detach signal handlers - This will allow the embedding process to figure out when it wants to do this... rather than on each function call

Mar 18 '20 01:03 acfoltzer

Sounds good!

Mar 20 '20 03:03 shravanrn

lucet lucet copied to clipboard

Exposing a fast version of Instance::run that reduces number of checks

lucet
lucet copied to clipboard