design Typed continuations to model stacks

Motivation

Wasm currently lacks any support for multiple stacks of execution and switching between them. That prevents it from efficiently representing most control flow abstractions that appear in many modern programming languages, such as

lightweight (“green”) threads
coroutines
generators
async/await
effect handlers
call/cc
…

In the spirit of Wasm as a low-level engine, it's highly undesirable to add a multitude of ad-hoc features for each of this mechanisms individually. Instead, it's preferable to identify a sufficiently general mechanism for managing execution stacks that allows expressing all of them. More concretely, Wasm needs a way to represent, create, and switch between stacks.

Challenges

In order to be able to express features like green threads and others, switching stacks must be possible at any stack depth, not just at the bottom of the stack (“deep coroutines”).
In order to be able to express features like generators, it must be possible to pass values (back and forth!) when switching to a different stack.
In order to be able to validate the usage of stacks, both stacks and the values being passed when switching must be properly typed.
In order to inform the semantics of the interaction with other control flow, esp exceptions, it is desirable to have a semantic framework that can explain all such constructs.
In order for multiple of the above features to be usable at the same time, they must be expressible in a composable fashion that allows arbitrary nesting.
The mechanism must not require copying or moving stack frames, since some engines could not easily do that due to the heterogeneous nature of their stacks.
It must not be possible to create cyclic references to stacks, since that would require GC.

Proposal

One known way to address the above challenges is to view stacks slightly more abstractly, through the notion of a delimited continuation. Broadly speaking, that represents “the rest of” a computation (a continuation) “up to” a certain point (delimited). Concretely, that certain point is the bottom of a specific stack, while the stack frames on the stack essentially describe the remaining tasks to be completed.

In particular, we can apply known ways to type such continuations, which is not a problem with an immediately obvious solution otherwise. One specific way of typing continuations and the values communicated back and forth is by following the approach taken by so-called effect handlers, one modern way of representing delimited continuations, which essentially are exception handlers with resumption, where both throw and resume can carry along arguments.

Like with asymmetric coroutines, switching between stacks then occurs through a pair of resume and suspend instructions. The suspend instruction “throws” a value akin to an exception — we call it an event below — which the resume handles. The event's tag, like a regular exception tag, determines the types of values passed from suspend to resume point, but also those passed back.

Details

The first central addition is a new form of resumable exception, a.k.a. event. Such an event is declared with a respective new form of exception definition (in the binary format this is an exception definition with a certain flag bit):

(event $evt (param tp*) (result tr*))

Unlike a regular exception, events can have return values.

The other central addition is a new (reference) type of continuations:

(cont $ft)

where $ft is a type index denoting a function type [t1*] -> [t2*]. Intuitively, this describes a suspended stack that is resumable with values of types t1* and will ultimately terminate with values of types t2*.

A new stack is created with the instruction

cont.new : [(ref $ft)] -> [(cont $ft)]

which takes a function reference to run on the new stack. Execution isn’t started before resuming the continuation.

The instruction

cont.resume (event $evt $handler)* : [t1* (cont $ft)] -> [t2*]

resumes such a continuation, by passing it the expected arguments t1*. Once the computation terminates, it returns with t2*. If the computation suspends before termination (see next instruction), with one of the event tags listed, then execution branches to the respective label $handler. This happens without unwinding the stack. The label receives the event arguments tp* and a continuation of type (cont $ft'), where $ft' is the function type [tr*] -> [t2*] of the next continuation.

Continuations are single-shot, that is, they cannot be resumed more than once. An attempt to resume an already resumed continuation traps.

A computation running on its own stack (i.e., initiated as a continuation), can suspend itself with the following instruction:

cont.suspend $evt : [tp*] -> [tr*]

where $evt is an event of respective type [tp*] -> [tr*]. Essentially, this passes control back to the parent stack. That is, like throw the exception takes arguments tp*, and transfers control to the innermost cont.resume, or rather, its handler label, respectively. But unlike regular exceptions, execution can also be resumed, which returns here with values of type tr*.

As described above, the innermost active handler matching the event tag receives the event's tp* and a new continuation. Resuming this continuation (with cont.resume) will therefore pass back tr* to the originating cont.suspend.

Resumption can nest, i.e., a continuation may itself resume another continuation. Consequently, an event may generally pass through multiple parent stacks before hitting a matching resume handler. The same chain of stacks is maintained when resuming.

Note how the exception tag ties together the types of the values passed in both direction between a matching suspend and resume. Different suspension points within the same computation can use different exception tags and thereby pass different types of values back and forth between parent and child stack.

Continuations are first-class values that are allowed to escape a handler. That way, a stack can be kept live to be resumed at a later point. The stack’s lifetime ends when the final resume terminates regularly (or with an exception, see below). Managing the lifetime of continuations/stacks and evtrefs generally requires reference counting in the VM. However, it’s impossible for them to form cycles, so general garbage collection is not needed.

Finally,

cont.throw $exn : [te* (cont $ft)] -> [t2*]

aborts a continuation by injecting the (regular) exception $exn with arguments of type te* at the suspension point. This is a way to explicitly unwind the stack associated with the continuation. In practice, a compiler should make sure that every continuation is used linearly, i.e., either resumed or aborted with a throw. However, there is no simple way to enforce this in a type system, so Wasm validation won’t be able to check this constraint.

Example: A simple generator

To see how these mechanisms compose, consider encoding a generator enumerating i64 integers from 0 up until a certain condition is met. The consumer can signal back this condition by returning a bool (i32) to the generator. This can be expressed by the following event.

(event $enum-yield (param i64) (result i32))

Here is the actual generator:

(func $enum-until  (param $b i32)
  (local $n i64) 
  (local.set $n (i64.const -1)) 
  (br_if 0 (local.get $b))
  (loop $l 
    (local.set $n (i64.add (local.get $n) (i64.const 1))) 
    (cont.suspend $enum-yield (local.get $n)) 
    (br_if $l) 
  )
 )

Here is one possible consumer, which runs the generator until a given max value is reached:

(func $run-upto (param $max i64) 
  (local $n i64) 
  (local $cont (cont (param i32)))
  (local.set $cont (cont.new $enum-until)) 
  (loop $l 
    (block $h (result i64 (cont (param i32)))
      (cont.resume (event $enum-yield $h)
        (i64.ge_u (local.get $n) (local.get $max)) 
        (local.get $cont)
      )
      (return)
    ) 
    (local.set $cont)
    (local.set $n)
    ;; ...process $n...
    (br $l)
  )
)

Note how the handler $h takes both the i64 argument produced by the generator and the next continuation.

Example: A simple thread scheduler

For a more interesting example, consider a simple scheduler for green threads. We define two events with which a thread can signal the scheduler to either yield or spawn a new thread:

(event $yield)
(event $spawn (param (ref $proc)))

where

(type $proc (func))

We further assume that there is a global variable encoding a thread queue in form of a list of coninuations:

(global $queue (list-of (cont $proc)) ...)

The exact representation does not matter, so list-of is just pseudo code. All we need is three obvious operations on this queue, ignoring their details:

(func $enqueue (param (cont $proc)) …)
(func $dequeue (result (cont $proc)) …)
(func $queue_empty (result i32) …)

Given these prerequisites, a simple scheduler loop can be implemented as follows:

(func $scheduler (param $main (ref $proc))
  (cont.new (local.get $main)) (call $enqueue)
  (loop $l
    (if (call $queue_empty) (then (return)))   ;; program is done
    (block $on_yield (result (cont $proc))
      (block $on_spawn (result (ref $proc) (cont $proc))
        (call $dequeue)
        (cont.resume (event $yield $on_yield) (event $spawn $on_spawn))
        (br $l)                                ;; thread terminated
      )
      ;; on $spawn, proc and cont on stack
      (call $enqueue)                          ;; continuation of old thread
      (cont.new) (call $enqueue)               ;; new thread
      (br $l)
    )
    ;; on $yield, cont on stack
    (call $enqueue)
    (br $l) 
  )
 )

TODO: More examples.

Interaction with regular exceptions

Regular exceptions and events are disjoint. An exception handler will not catch events and vice versa.

If resuming a continuation throws a regular exception that is uncaught on its stack then it propagates through the active resume, unwinding the stack encapsulated in the continuation and ending its lifetime.

Yielding an event on the other hand does not trigger exception handlers between the suspend and the resume point. They become active again when the continuation resumes and execution thereby switches back to their stack.

Note that this dichotomy could, in principle, be used to implement exceptions with two-phase unwind: the first phase is represented by an event, whose handler then injects back a proper exception to unwind. However, such a an implementation would be fairly expensive, since every exception handler would have to run on its own stack. It is still useful as a semantic explanation of what ought to happen, and how two-phase unwind would interact with single-phase exceptions.

Jul 29 '20 18:07 rossberg

Is this identical to what you presented in the past @rossberg ? (If not a summary of the changes might be useful.)

Jul 29 '20 18:07 kripken

@kripken, yes, up to some renaming and minor tweaks.

Jul 29 '20 19:07 rossberg

Got it, thanks @rossberg .

Jul 29 '20 20:07 kripken

Is there another proposal on the way on how to support applications of stack inspection, e.g. two-phase exception handling, resumable exceptions, C#/Python-style generators, stack tracing, linear-memory garbage collection, stack duplication?

Jul 29 '20 20:07 RossTate

cont.throw doesn't take an evtref, so it's not possible to respond to an unknown suspension with an exception. Is that intentional?

The new function type $ft' : [tr*] -> [t2*] isn't a type that the module specifies, but one that is synthesized from other types. Doesn't that make for significantly more work at validation-time, as the synthesized type has to be checked for compatibility with a (later) specified function type?

Jul 29 '20 21:07 taralx

@RossTate Doesn't this support those, if indirectly, since the exceptions are resumable?

Jul 29 '20 22:07 taralx

@taralx I believe you're right that all of those features can be implemented in terms of this proposal. However, for many of those features, allocating an entire new stack would be overkill. Proposals for stack inspection, two-phase unwinding, etc. are essentially special cases of the feature proposed here that can be implemented more efficiently.

Jul 29 '20 22:07 tlively

Yes, and on top of that, this proposal requires running code just to find out if an inspection site is relevant to the inspection at hand. So I'm wondering if interaction with (more efficient) stack inspection has been taken into account in this design.

Jul 29 '20 22:07 RossTate

Putting stack inspection aside, the lightweight-threads example seems to be a relatively inefficient implementation of lightweight threads. Ideally, switching between threads just involves a few instructions to switch contexts. But in this proposal, switching threads involves a stack walk to find the relevant br_on_evt, which is not necessarily cheap.

Jul 29 '20 23:07 RossTate

@taralx:

cont.throw doesn't take an evtref, so it's not possible to respond to an unknown suspension with an exception. Is that intentional?

That's a good point. But you cannot actually get an evtref for an unknown suspension either, since each handler is guarded by a list of known events. The idea here was that a handler has no business interacting with an unknown suspension -- that would destroy composition. But maybe there are use cases for a handle-all.

The new function type $ft' : [tr*] -> [t2*] isn't a type that the module specifies, but one that is synthesized from other types. Doesn't that make for significantly more work at validation-time, as the synthesized type has to be checked for compatibility with a (later) specified function type?

I don't think so, since validation already has to be able to treat multiple definitions of the same function type as compatible. Function types are structural, even in the MVP.

FWIW, there is a similar question regarding func.bind and other such instructions. Currently, it takes a target type as immediate to side-step the issue, but it would actually be enough (and cheaper to check) if it just had an numeric immediate that indicates the number of parameters to bind and synthesise the resulting function type.

So generally speaking, we have to decide whether we want to avoid synthesising, at the price of more type annotations and checking on instructions like these, or if we're fine with synthesising such types (which may require slightly more work in the spec primarily).

@RossTate Doesn't this support those, if indirectly, since the exceptions are resumable?

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

@tlively is correct however that there is one special case where creating a new stack can be overkill, namely when the continuation does not escape the handler. In practice, though, handling some of these cases without a stack would require the introduction of either fragmented stack frames, display chains, or scope restrictions that essentially make it equivalent to just calling another function. It is worth investigating if adding those mechanisms is a relevant win over implementing them in user space.

@RossTate:

Putting stack inspection aside, the lightweight-threads example seems to be a relatively inefficient implementation of lightweight threads. Ideally, switching between threads just involves a few instructions to switch contexts. But in this proposal, switching threads involves a stack walk to find the relevant br_on_evt, which is not necessarily cheap.

That is not correct. Handlers are associated with resume points, not branches. Consequently, a handler is only present "at the bottom" of each non-suspended stack, and there is no need to walk an actual stack. Moreover, each resume declares a list of tags it handles. So transferring control to a handler can e.g. be implemented by an indirect jump through a (thread-local) side table associating each event tag with its active handler.

Jul 30 '20 09:07 rossberg

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

Yes, and the same is true for every other suggestion that has been made for directly supporting these features. Only the suggestion of implementing shadow stacks within WebAssembly without direct support is not composable.

however that there is one special case

Let's work through the first such "special case" that was discussed four months ago: linear-memory garbage collection. Runtimes and programs with their own (linear-)memory management generally have some way to inspect the stack to find and later update the GC roots on demand. These programs now want to compile to WebAssembly and want some way to inspect the stack on demand to find and update these roots (albeit in a safer manner per WebAssembly's security requirements).

Your proposal is to allocate a new stack at every point someone would want to perform such inspection. For these applications, nearly every stack frame has such a root and therefore such a need. So your proposal is essentially to have these applications allocate every stack frame on its own separate stack. @aardappel, as someone who has been particularly advocating for this feature, do you think this would serve your needs?

Now the second "special case" that was discussed was stack tracing. All the most-used garbage-collected languages have some means for collecting/examining the stack trace (sometimes even lazily using an interactive stack walk) during execution even in production mode (not just debug mode). Again, nearly every frame would need to provide stack-trace information, so again your proposal would have all these languages allocate every stack frame on its own separate stack.

Moreover, each resume declares a list of tags it handles.

Ah, missed this change. There's a bug though. cont.forward also needs to specify a (handler) label and a list of events.

Oh, and another (unrelated) bug: cont.throw needs to consider that the continuation might catch the thrown exception. So it has an output type, and it needs to specify a (handler) label and a list of events.

Consequently, a handler is only present "at the bottom" of each non-suspended stack, and there is no need to walk an actual stack. Moreover, each resume declares a list of tags it handles. So transferring control to a handler can e.g. be implemented by an indirect jump through a (thread-local) side table associating each event tag with its active handler.

This assumes a particular implementation strategy and infrastructure, essentially baking in decisions that would often be made at the application level (e.g. should this dynamically scoped variable be implemented using a stack walk or using thread-local storage). If we're going to assume there is always the infrastructure for thread-local storage, it might be a good idea to incorporate the feature directly into WebAssembly, since it's useful for a lot more than just continuations. Regardless, this means that in the examples above every call and return (in the surface language) would have to update thread-local storage. In general, every suspend/resume, forward, and throw will have to update thread-local storage, which common implementations of lightweight threads do not need.

Jul 30 '20 15:07 RossTate

When you presented this proposal six months ago, it was really great. You provided a lot of great ideas and garnered a lot of interest for an unconventional yet useful feature. That sparked a lot of productive conversations within the CG. Those conversations explored the design space, identifying weaknesses in the specifics you presented and then searching for alterations and extensions that would address those weaknesses. That is exactly what the effect of a good presentation should be. But this design has barely changed to incorporate those many related conversations within the group over the months since.

Jul 30 '20 15:07 RossTate

This design seems especially well suited to implement first class stack-full asymmetric co-routines, as it maps relative 1:1 on its implementation needs.

One place where it is more general than such a co-routine feature is the use of event tags, which essentially allow one to yield not just from the current co-routine, but from any of the parents, presumably suspending a chain of co-routines in the process. It is not quite clear to me when this is useful, and I could imagine it having an implementation cost. I guess this gives us the flexibility of continuations where each stack frame has its own lifetime.

The machinery also seems quite heavy-weight for typical uses of "generators", which tend to function more like fancy for-loops than anything else. There would have to be some kind of guarantee that it is easy to collapse this functionality (no dynamic allocations, re-use of the same stack etc etc) for this to be feasible. Leaving this up to the cleverness of the engine makes it useless.

That holds even more if this is to be useful for stack inspection like @RossTate mentions. Even if we can specify that in simple cases these instructions translate to something close to no-ops, it seems like it would generate a lot bigger binaries than the simpler stack inspection proposal. To me, the whole point of having something like a stack inspection feature is that it is more efficient than a shadow stack, so if there's even a small chance that it isn't (e.g. requires a dynamic allocation) then it will not be used.

So as it stands, this feature seem limited in its applicability, unless its generality can be subdivided in semantically more restrictive uses with different implementation requirements. That is counter-intuitive maybe: the more general a feature is, the more things it can represent. But potentially the less features it can represent efficiently, which is what matters the most for Wasm, I feel.

Jul 30 '20 16:07 aardappel

Another consideration is how to inspect (efficiency aside) the delimited continuations. For example, a garbage-collecting lightweight-thread manager needs to collect (and later update) the roots across all the lightweight threads. Given a particular cont representing a yielded lightweight thread, it's not clear how the thread manager can reasonably collect the cont's roots using this approach.

Jul 30 '20 17:07 RossTate

I think, for fundamental performance/implementation reasons, we should expect that there are two features we need:

the ability for wasm to cheaply create, suspend and resume a lightweight stack (green thread)
the ability for wasm to iterate over and inspect (and possibly mutate) opt-in locals of frames on the stack without having had to significantly change the representation of the stack

I think this proposal speaks to 1, whereas stack iteration proposals speak to 2. They should obviously be designed to compose, but as far as I can tell from an perf/impl POV, I don't think it's practical to assume that one proposal is going to satisfy all our use cases and achieve all of our goals here. I also think the relative priorities of these features are different: there are several high-priority use cases for 1 whereas at least one wasm engine has specifically voiced an intention to delay 2 until we have wasm GC in place.

Jul 30 '20 18:07 lukewagner

Yeah, a common thread across feedback we got for #1340 was to prioritize first-class stacks over stack inspection. My presentation last week was to gauge whether stack inspection should still be a feature in the pipeline, and based on online and offline feedback the answer seems to be likely yes (provided it comes out after GC). This is important to know for designing first-class stacks because, just as stack inspection can be combined with unwinding to support two-phase exception handling, stack inspection can be combined with (restricted) detaching or redirecting to support (one-shot) delimited continuations. So if you want a "unifying view on stack-related mechanisms", you need to take stack inspection into account, which this proposal has not.

Jul 30 '20 19:07 RossTate

@RossTate:

It does, and more importantly, it does so in a composable fashion -- an important point I forgot to include (added). I.e., you can use multiple of these features together at the same time, even if they are implemented separately.

Yes, and the same is true for every other suggestion that has been made for directly supporting these features.

No, it is not. It is one of the main innovation of effect handlers, and the reason why PL folks got excited about them, that they are a compositional mechanism for defining effects, including control effects.

Your proposal is to allocate a new stack at every point someone would want to perform such inspection.

I have not been proposing that. The only thing I have mentioned is that effect handlers can express this form of stack inspection (and they immediately explain how that would interact with other control features). But obviously, they would only suitable as an actual strategy if we were able to reliably optimise the special case I mentioned -- which may or may not be the case in this setup. If not, there needs to be a more explicit alternative that can be.

There's a bug though. cont.forward also needs to specify a (handler) label and a list of events.

Why? I don't think it does.

Oh, and another (unrelated) bug: cont.throw needs to consider that the continuation might catch the thrown exception. So it has an output type

Ah thanks, you're right, I mixed up signatures. Fixed.

and it needs to specify a (handler) label and a list of events.

We could do that. But it's not clear whether it would be all that useful. The idea is that cont.throw is used to "destroy" a stack, when it is not supposed to be completed anymore. At least with the use cases listed, it shouldn't throw back to the handler anymore in that situation. But maybe there are some odd use cases? This is one of the details that needs figuring out.

This assumes a particular implementation strategy and infrastructure, essentially baking in decisions that would often be made at the application level

It doesn't assume, it just points out one possible implementation strategy. It may not even be the best one, as fast implementations have used other strategies.

When you presented this proposal six months ago, it was really great. You provided a lot of great ideas and garnered a lot of interest for an unconventional yet useful feature. That sparked a lot of productive conversations within the CG. Those conversations explored the design space, identifying weaknesses in the specifics you presented and then searching for alterations and extensions that would address those weaknesses. That is exactly what the effect of a good presentation should be. But this design has barely changed to incorporate those many related conversations within the group over the months since.

I honestly don't know what you mean. AFAICT, there hasn't been much discussion of this proposal. There have been various discussions about various ideas, and occasionally I pointed out the connection. Correct me if I'm wrong, but none of these recent ideas has been presented as a substitute for this one, so its utility still stands. Whether it is sufficient to also subsume some of the others (efficiently) is a separate discussion to be had.

So if you want a "unifying view on stack-related mechanisms", you need to take stack inspection into account, which this proposal has not.

Yes, well, the nice thing about this proposal is that it can express it, even if not efficiently, but thereby at least suggesting a canonical semantics for integrating such a feature.

@aardappel:

This design seems especially well suited to implement first class stack-full asymmetric co-routines, as it maps relative 1:1 on its implementation needs.

Ah, it may look like it from first glance, but it's not at all specific to or optimised for co-routines. Its heritage is (algebraic) effect handlers, which are a principled way of expressing and composing almost any sort of state or control effect.

One place where it is more general than such a co-routine feature is the use of event tags, which essentially allow one to yield not just from the current co-routine, but from any of the parents, presumably suspending a chain of co-routines in the process. It is not quite clear to me when this is useful, and I could imagine it having an implementation cost. I guess this gives us the flexibility of continuations where each stack frame has its own lifetime.

Ah, the ability to nest resumption is what makes this mechanism composable. It is what allows each control abstraction to be implemented separately, e.g., you can run an async function in a green thread and still yield to the scheduler without worrying about intermediate handlers. That significantly simplifies implementations, as well as enabling the use of control abstractions in a heterogenous setting with multiple interacting languages.

The machinery also seems quite heavy-weight for typical uses of "generators", which tend to function more like fancy for-loops than anything else. There would have to be some kind of guarantee that it is easy to collapse this functionality (no dynamic allocations, re-use of the same stack etc etc) for this to be feasible. Leaving this up to the cleverness of the engine makes it useless.

It is much less heavyweight than you might think. We have reason to believe that implementing generators this way can be just as or even more efficient on average than the common approach of constructing a state machine, which often involves additional local control flow to reenter the previous state and requires copying live locals to the heap at every yield, or allocating them there in the first place. Plus, it immediately allows deep generators (yielding from a nested call) without any additional overhead.

Jul 31 '20 08:07 rossberg

The design outlined above is similar to that of effect handlers in Multicore OCaml. In practice, the overhead of implementing a range of abstractions from generators to lightweight threads to async/await on top of effect handlers in Multicore OCaml is close to zero.

Jul 31 '20 10:07 slindley

Another consideration is how to inspect (efficiency aside) the delimited continuations. For example, a garbage-collecting lightweight-thread manager needs to collect (and later update) the roots across all the lightweight threads. Given a particular cont representing a yielded lightweight thread, it's not clear how the thread manager can reasonably collect the cont's roots using this approach.

There is a standard way of implementing this which Multicore OCaml does and so goes Go, GHC Haskell, Fibers in OpenJDK and other languages which use lightweight threads. Essentially, the continuations (blocked & ready to run goroutines in Go, blocked lightweight threads in GHC) become a distinguished object in the GC, and the GC knows how to scan the stack associated with the continuation. The suspended continuations are marked as regular objects would be. For any suspended continuation reachable from the GC roots, the GC scans the stack associated with the continuation in the same way it would scan the current program stack.

Jul 31 '20 12:07 kayceesrk

@kayceesrk That comment was meant to be in the context of modules that implement their own garbage collector in linear memory, though it's my fault for not making that clear. So my point was that a module-implemented garbage collector would not be able to perform the scans you describe.

Jul 31 '20 13:07 RossTate

It is one of the main innovation of effect handlers, and the reason why PL folks got excited about them, that they are a compositional mechanism for defining effects, including control effects.

Effect handlers are a surface-level language feature. Depending on specifics of the handlers, they can be compiled in a variety of ways. While continuations are always a viable option for any handler, they are generally by far the least efficient option. Even the body of work you are basing this on has significant discussion of analyses and optimizations for identity common special cases and eliminating, but many of those optimized forms are not expressible in this proposal. So what's problematic about this proposal is that it's based on high-level concepts but does not provide the corresponding low-level primitives that those concepts are often implemented with.

Correct me if I'm wrong, but none of these recent ideas has been presented as a substitute for this one, so its utility still stands. Whether it is sufficient to also subsume some of the others (efficiently) is a separate discussion to be had.

WebAssembly/exception-handling#105 was written up to describe these low-level primitives. You were explicitly informed here that it could express algebraic effects, but you never engaged in that conversation. #1340 was a Phase 0 proposal to explore those ideas. But in that conversation concerns were raised about security, prioritization, efficiency, and complexity. The presentation I gave last week was following up on these concerns regarding stack inspection, which we needed to gauge in order to determine how best to address the concerns regarding first-class stacks.

The only thing I have mentioned is that effect handlers can express this form of stack inspection (and they immediately explain how that would interact with other control features). But obviously, they would only suitable as an actual strategy if we were able to reliably optimise the special case I mentioned -- which may or may not be the case in this setup. If not, there needs to be a more explicit alternative that can be.

This more explicit alternative has already been discussed a fair amount. It is stack inspection, e.g. #1340 and #1356. Notice that your cont.suspend instruction has the same type as call_stack in #1356. The difference is that call_stack does not require an answer to perform stack allocation just to find out basic information or make basic updates to the current stack. But if you combine it with stack-allocation and stack-redirection primitives, an answer can choose to use continuations if it needs to in order to support the surface-level feature at hand, as we show in #1360. That is, cont.suspect breaks down into a bunch of smaller primitive operations and is not a primitive operation itself.

So this proposal is two iterations behind others that were heavily inspired by it, but which have since broken its concepts down into smaller parts that can be combined more flexibly, and which have already developed answers to many of the open questions in this proposal.

Jul 31 '20 15:07 RossTate

Will this proposal be compatible with the changes I proposed in WebAssembly/exception-handling#123, namely, removal of exnref, introduction of catch_br, and split of catch and unwind?

Aug 01 '20 04:08 aheejin

@aheejin From what I can tell, it should be compatible with those changes.

Aug 01 '20 04:08 RossTate

soil-initiative/stacks#10 now has two translations of this proposal in terms of #1360, one using stack inspection and one using thread-local storage (assuming such a thing exists in the host and is made accessible to WebAssembly). Each implementation strategy has its strengths and weaknesses, and different effects might be better expressed with one or with the other (the translations compose so that different effects can in fact make different choices even within the same application). One of the problems with the proposal here is that it does not let the application pick its own implementation strategy; instead it bakes in a high-level feature and leaves it to the engine to decide/guess which strategy to use, much like how the JVM bakes in interface-method dispatch.

Aug 04 '20 05:08 RossTate

The green threads scheduler example shown by @rossberg at the 2020-08-04 meeting appears with the addition of exception handling to catch exception from the cont.resume to be sufficient to implement a scheduler for Lumen (Erlang compiler/runtime).

On x86_64 where we have native stack switching we add yield points (here encoded as calls to @__lumen_builtin_yield()) when each Erlang process (green thread) has used up its time slice (encoded as @CURRENT_REDUCTION_COUNT reaching a static limit)

`init.erl`

-module(init).
-export([start/0]).
-import(erlang, [display/1]).

start() ->
  display(atom).

`init.llvm.mlir`

module @init {
  llvm.func @"erlang:display/1"(!llvm.i64) -> !llvm.i64
  llvm.func @__lumen_builtin_yield()
  llvm.mlir.global external local_exec @CURRENT_REDUCTION_COUNT() : !llvm.i32
  llvm.func @lumen_eh_personality(...) -> !llvm.i32
  llvm.func @"init:start/0"() -> !llvm.i64 attributes {personality = @lumen_eh_personality} {
    %0 = llvm.mlir.constant(20 : i32) : !llvm.i32
    %1 = llvm.mlir.addressof @CURRENT_REDUCTION_COUNT : !llvm<"i32*">
    %2 = llvm.load %1 : !llvm<"i32*">
    %3 = llvm.icmp "uge" %2, %0 : !llvm.i32
    llvm.cond_br %3, ^bb1, ^bb2
  ^bb1:	// pred: ^bb0
    llvm.call @__lumen_builtin_yield() : () -> ()
    llvm.br ^bb2
  ^bb2:	// 2 preds: ^bb0, ^bb1
    %4 = llvm.mlir.constant(562949953421378 : i64) : !llvm.i64
    %5 = llvm.mlir.addressof @CURRENT_REDUCTION_COUNT : !llvm<"i32*">
    %6 = llvm.mlir.constant(1 : i32) : !llvm.i32
    %7 = llvm.atomicrmw add %5, %6 monotonic  : !llvm.i32
    %8 = llvm.call @"erlang:display/1"(%4) {tail} : (!llvm.i64) -> !llvm.i64
    llvm.return %8 : !llvm.i64
  }
}

On the BEAM for Erlang, the schedulers can do work stealing. How continuations are stored was hand-waved in the presentation. @rossberg will the continuations be stealable/transferable between Web Workers to allow for a scheduler per Web Worker (and one on the main UI thread) to work steal?

Aug 04 '20 17:08 KronicDeth

Updated OP to match the version I presented today, removing separate evtref type and br_on_evt instruction. Also removed cont.forward instruction, since it may not be needed and its semantics seems to be confusing.

Aug 04 '20 17:08 rossberg

@kayceesrk Thanks for the talk! It was very clear and well presented.

In your talk, you showed some assembly code on how you implemented algebraic effects in Multicore OCaml. I feel like a goal of a WebAssembly design for stacks should be to enable WebAssembly programs to essentially write those assembly-level operations themselves, provided the functionality can be provided in a way that is similarly composable. Based on your experience with Multicore OCaml, what would be the key primitives for enabling that?

In the previous CG meeting it was suggested that these continuations would be used for effects that would have a handler in roughly every stack frame. The implementation strategy you presented involved a linear walk through effect handlers, which suggests that having frequent effect handlers would affect performance. Have you run experiments in the context of such effects? If not and you would like a concrete example to try, make a pass through the source code so that every function call has a handler indicating what file and line number the function call is on (or was on in the original source code) so that effect handlers can be used to implement stack tracing (turning off any optimizations you have to eliminate continuation allocations). I'd be interested to hear what the relative performance of the original and the modified programs are.

Aug 04 '20 17:08 RossTate

@rossberg Those were much appreciated improvements!

Aug 04 '20 17:08 RossTate

@rossberg will the continuations be stealable/transferable between Web Workers to allow for a scheduler per Web Worker (and one on the main UI thread) to work steal?

I'm guessing that continuations will not be accessible across threads in the first version, regardless of which stack switching proposal we go with.

Aug 04 '20 17:08 binji

Yeah, I suspect that we won't even be able to pass simple data references between threads for quite some time to come, let alone functions or stacks. The hurdle isn't so much semantics but implementations: most web engines will have a very tough time retrofitting any of that. Hope we get there eventually, though.

Aug 04 '20 18:08 rossberg