wasmi icon indicating copy to clipboard operation
wasmi copied to clipboard

Allow resumable calls to recover after running out of fuel

Open tomaka opened this issue 2 years ago • 10 comments

As we all know, because of the halting problem, it isn't possible to know whether the execution of a Wasm function will finish in a finite time. While a Wasm function is being executed, maybe everything is okay and it is simply taking a long time, or maybe the Wasm execution is stuck in an infinite loop. It isn't possible to know in which of the two scenario we currently are.

Because the program that is interpreting the Wasm doesn't want, itself, to be stuck infinitely, it needs some sort of protection against the Wasm being stuck in an infinite loop. While the program interpreting the Wasm can't know whether the Wasm is stuck in an infinite loop, it can use heuristics. For example, if the Wasm has been executing for 10 seconds while normally it should take around 2ms, then it is likely to be stuck.

Wasmi provides a fuel system, making it possible to interrupt the execution of the Wasm after a certain number of instructions have been executed. This system is however not very flexible. It's all-or-nothing: when all the fuel has been consumed, the execution completely stops and can never be resumed.

Furthermore, it is difficult to map "fuel quantities" with real world time/CPU usage. A precise fuel number means that whether the execution will run out of fuel or not doesn't depend on the speed of the machine, which has the advantage to be deterministic. However, this is also a drawback: it is not possible to stop the execution after an approximate number of seconds. If you pick a certain fuel quantity, on some platforms this fuel will expire quickly and on some others it will expire very slowly. Fuel quantities are very arbitrary anyway, and I think that it is legitimate to want to stop after a certain number of real life seconds.

In order to solve that problem, I would suggest to make it possible to combine the call_resumable feature with the fuel system. When the fuel expires, call_resumable should return a Ok(ResumableCall::Resumable) (or something similar) instead of an Err. This would make it possible for the API user to choose what to do: either stop the execution, or add more fuel and resume the execution. This provides maximum flexibility to this feature. The API user could do things such as print a warning after a certain time, ask the human whether to continue or not, de-prioritize the thread executing the Wasm after a certain time, etc.

tomaka avatar Feb 27 '23 13:02 tomaka

Being able to use the fuel system together with call_resumable would be perfect. Would this be a difficult feature to implement? Perhaps I can find some time tonight to fork wasmi and implement this.

Also, if this feature is not possible for technical reasons, would it still be possible to let wasmi insert a function call to an imported function every n instructions? This way, we could at least have a system to yield from execution. This could help with using wasmi in an async context.

Lucky4Luuk avatar Feb 27 '23 13:02 Lucky4Luuk

@tomaka Thanks for writing the issue!

I have a few questions ...

  1. In order to implement this feature we need to adjust the current API fro resumable function calls as it won't make sense to query the host function that caused the call to be interrupted in the first place. Also for efficiency reasons we might choose to diverge. What is the information that users of resumable function calls actually require? Is function type of the failed function enough? (Minimal information.)

  2. I suppose resumable calls should still return error for normal Wasm traps and this issue is basically about implementing a special rule for OutOfFuel traps, right?

To answer some questions:

At the moment fuel is somewhat arbitrary. However, there are plans to adjust our fuel costs to a similar policy as to what Wasmtime does in where basically one instruction roughly costs 1 fuel to execute. Currently we are trying to do a more precise job but since this can only be a heuristic at best it is better to make this even simpler just like Wasmtime did.

@Lucky4Luuk I think it should be technically feasible to implement this feature. The only actual question is if it is possible to implement this feature without massive performance regressions. We have been dealing with lots of those performance regressions for various reasons lately. Still trying to figure out why. At the moment most of the currently open PRs cannot be merged due to weird performance regressions.

For the same reason we probably won't implement a feature to call a host function every X instructions. First this is heavily underspecified and secondly host function calls are among the most expensive things you can do in wasmi. So we usually try to minimize those if possible.

Robbepop avatar Feb 27 '23 22:02 Robbepop

In order to implement this feature we need to adjust the current API fro resumable function calls as it won't make sense to query the host function that caused the call to be interrupted in the first place. Also for efficiency reasons we might choose to diverge. What is the information that users of resumable function calls actually require? Is function type of the failed function enough? (Minimal information.)

Well, in smoldot I also need the parameters that were passed to the host function. Again, the idea is simply to implement the body of the host function "externally" (which makes it possible for example to perform asynchronous operations) and not within a callback.

I suppose resumable calls should still return error for normal Wasm traps and this issue is basically about implementing a special rule for OutOfFuel traps, right?

That's your choice. I suppose that every error that can be resumed from could lead to Ok(ResumableCall::Resumable) instead of Err. Of course, it would then need to be explained which ones are and which ones aren't, or you could change Trap to be enum Trap { Resumable(ResumableTrap), NonResumable(NonResumableTrap) }.

At the moment fuel is somewhat arbitrary. However, there are plans to adjust our fuel costs to a similar policy as to what Wasmtime does in where basically one instruction roughly costs 1 fuel to execute. Currently we are trying to do a more precise job but since this can only be a heuristic at best it is better to make this even simpler just like Wasmtime did.

Note that this is not really related to my explanation. "One instruction == one fuel" is just as arbitrary, because I assume that nobody is going to calculate the amount of fuel to pass to wasmi by precisely counting how many instructions are necessary (it would be a relatively bad idea since that number depends on the goodwill of the compiler). Instead, you just pass an arbitrary fuel value and increase it if it's not enough.

tomaka avatar Feb 28 '23 08:02 tomaka

Well, in smoldot I also need the parameters that were passed to the host function.

Technically you could embed those parameters into the host error that caused the interruption of the resumable call, right? The current API does not offer this functionality (yet) and I am not sure it will because it would introduce serious overhead for resumable call handles. It might still be a good idea to implement this if the majority of use cases require this information.

or you could change Trap to be enum Trap { Resumable(ResumableTrap), NonResumable(NonResumableTrap) }.

That's partially what already happens in the Engine not visible to users. However, to me it is important that the resumable call API does not crawl on top of every other abstraction in wasmi but is kept isolated.

I will think about a possible API to extend resumable calls with OutOfFuel traps and post it here once I am done. However, there is other work to be done beforehand so it will probably take a little while. Until then you are free to be posting your idea of what you'd expect of a resumable call API that would satisfy all your needs as of now. :)

Robbepop avatar Mar 01 '23 08:03 Robbepop

@tomaka How pressing is fixing this issue for smoldot?

Robbepop avatar Mar 11 '23 11:03 Robbepop

It's not pressing.

The objective is simply for smoldot to be able to resist to malicious chain. At the moment, if you connect smoldot to a malicious chain, this chain can freeze smoldot. However, I don't see a legitimate use case where you want to connect to a chain without knowing whether it's malicious. Typically, you connect a light client to a chain that you know, not an unknown chain. Even if it exists, all the malicious chain can do is a be a small nuisance by freezing smoldot, and it's doubtful why someone would bother with that kind of attack in the first place.

tomaka avatar Mar 11 '23 12:03 tomaka

It's not pressing.

The objective is simply for smoldot to be able to resist to malicious chain. At the moment, if you connect smoldot to a malicious chain, this chain can freeze smoldot. However, I don't see a legitimate use case where you want to connect to a chain without knowing whether it's malicious. Typically, you connect a light client to a chain that you know, not an unknown chain. Even if it exists, all the malicious chain can do is a be a small nuisance by freezing smoldot, and it's doubtful why someone would bother with that kind of attack in the first place.

Okay thanks a lot for this info! I am going to fix this eventually and properly but it is good to know that you are not blocked by this issue for now.

Robbepop avatar Mar 11 '23 12:03 Robbepop

Hello. Not to necro this thread but I have a viable use case for this in my operating system AbleOS. I'd like to have this functionality.

AbleTheAbove avatar Apr 05 '23 01:04 AbleTheAbove

This is currently blocked mostly because of two things:

  1. I am not yet super confident that I have a decent enough understanding of the technical requirements of users (you). What I'd need here is the exact semantics and what behaviors you'd expect for what you need and in the best case even an example code for your personal use case.
  2. I don't yet know how this can be implemented as zero-cost abstraction or with minimal overhead in general. I am sure it can be done though.

Robbepop avatar Apr 05 '23 08:04 Robbepop

The goal: I'd like for the wasm engine to be pauseable so I may do task switching.

I have two main ways that I think my goal could be accomplished.

I think there are two ways this can be done:

  • Fuel metering with resuming
  • A tick based system

And if both are implemented I think there is enough variation that all use cases can be covered.

My usecase would require a pause button that halts the engine at the next instruction, stores state and can later be unpaused.

Thank you for the quick response!

AbleTheAbove avatar Apr 05 '23 14:04 AbleTheAbove

I'm currently also looking for a resumable runtime, as described here.

I'd like to allow AI agents to do work in terms of an embedded language, but due to the halting problem I'd want to be able to ensure that code output by an agent can't get stuck indefinitely and I'd like to be able to give feedback about running code, without blocking the agent runtime.

Being able to run with some notion of a "tick" or iteratively via fuel metering would be great.

For now I'm currently leaning towards using https://github.com/kyren/piccolo which supports this kind of resumability but ideally I would much rather support a WASM runtime instead of Lua. Maybe it's possible to take some inspiration from the "stackless" design of Piccolo in case that could be applied here.

rib avatar Apr 23 '25 15:04 rib

Hi @rib, it is totally possible to change Wasmi's resumable function semantics to also include resumption upon running out of gas. This whole issue is mostly about API design and shouldn't be a huge technical issue once this is figured out. The largest problem I see is with current users of the resumable call API because of the changed semantics.

Robbepop avatar Apr 23 '25 18:04 Robbepop

I gave this some thought today and I think design wise this is quite a simple task. Basically we only have to extend the Resumable variant of the ResumableCall type.

One way is to change the ResumableInvocation struct to a Rust enum with 2 variants, where one represents the failure of a host function call (as is today) and the other represent that the Wasm call ran out of fuel. We need two different types to represent those cases since both cases need to be handled differently.

For the host function call failure we need to provide the result of the host function call that we are going to replace (as is today) and for the case where the call ran out of fuel we need to be able to refill the Store and simply resume the call ordinarily.

The new ResumableInvocation enum could look like this:

pub enum ResumableInvocation {
    HostTrap(ResumableInvocationHostTrap),
    OutOfFuel(ResumableInvocationOutOfFuel),
}

pub struct ResumableInvocationHostTrap {
    // same as `ResumableInvocation` today
}

pub struct ResumableInvocationOutOfFuel {
    // todo
}

Another design is to flatten the ResumableCall enum and add yet another variant:

pub enum ResumableCall {
    Finished,
    HostTrap(ResumableCallHostTrap),
    OutOfFuel(ResumableCallOutOfFuel),
}

I am not sure which design is more user friendly and future proof.

cc @rib @tomaka @AbleTheAbove feel free to share your ideas as (potential) users.

Robbepop avatar Apr 24 '25 20:04 Robbepop

This has been implemented and will be available in the next Wasmi release.

cc @rib @tomaka @AbleTheAbove

Robbepop avatar May 04 '25 10:05 Robbepop