wazero icon indicating copy to clipboard operation
wazero copied to clipboard

feat: tail-call proposal

Open evacchi opened this issue 4 months ago • 1 comments

Implements the tail call extension proposal. Besides the included test cases, I have verified that this works with a simple factorial compiled with GHC, and also made an in-depth analysis on SQLite. Thanks to @ncruces's invaluable help, as usual <3, we discovered a lot of edge cases that were not covered in the spec test suite.

I understand we're all pretty busy these days, so I am obviously not calling for an urgent review :)


The PR adds Feature Flags modeled after those for Threads:

  • Added a new experimental feature flag CoreFeaturesTailCall to enable tail call support
  • Users need to explicitly enable this feature to use tail call instructions

The PR adds support in both the interpreter and compiler backends, I have also added a few e2e examples, some adhoc tests, and the spec testsuite.

Caveats and Notable Changes

  • Notably, the implementation in the compiler for both backends restores the stack as it would normally do in the epilogue (in fact, it invokes setupEpilogueAfter() in postRegAlloc()), then it jumps to the beginning of the function, thus it is not "just" a jump: it required adding corresponding pseudo-instructions to treat them specially.

  • In particular, on amd64, indirect calls required to mov the function pointer to a safe (caller-saved) register (I picked r11) to make sure it was not overwritten in the epilogue (which occurred to me in my testing).

Edge Cases

The following edge cases were only discovered through testing SQLite, and I have added e2e and ad-hoc test to reproduce them. SQLite (well, LLVM with tail calls enabled) does not necessarily make use of tail calls for recursion, but most often to optimize a call to a private function usually from a public, exported function. So this exposes a number of interesting situations that the spec tests do not account for (even though they are definitely not forbidden, according to the spec), like tail calling host functions.

  1. In the interpreter, I am pretty much just resetting the program counter and pointing the function body to the function that needs to be called; this does not work very well with imported functions (i.e. functions defined in a different module) and in particular this won't flight with host functions! However, since the tail call is mostly an optimization, I have decided that for now it is not worth optimizing these cases --this is experimental, after all; the interpreter will fall back to a plain call in these cases.

  2. In the compiler there are fewer limitations, with a bigger constraint: it is hard to perform tail calls in the presence of stack arguments (i.e. given the types have to match because of validation, this occurs when the number of args is above a given threshold, different in the ARM and AMD calling conventions [*]) the reasons are a. if the sizes of the required stack space do not match, then we might be writing garbage b. even if the sizes match, we would need special attention not to clear the stack completely before we jump into the callee c. even if we managed to do that, the spill slots are not necessarily matching so, we would still need some special care (notice that other runtimes have completely revolutionized their internal calling conventions for all cals to accommodate tail calls)

So, again, in these cases I have decided to fall back to plain calls.

Now, this could be addressed in two ways:

  • in the front-end by adding a fixed constant maxArgs that would be set to min(regArgsAmd, regArgsArm) == 8 and calling it a day

    this has the only downside of surfacing into the back-end a constant that is effectively arch-specific. If you want to see how that looks, take a look at this branch --- this lacks a lot of polish, but you'll get the idea

  • in the back-end with proper arch-specific handling, which is the solution I went with in the end

    unfortunately, because of the way and point in time we are emitting Returns, I think the best way to do this in the back-end is always emitting returns, basically treating tail calls effectively as plain calls almost until we get to the encoding phase. This way, we get the regalloc phase to work as usual; then:

    • if the tail call cannot be handled, then we just flip it to a traditional call, and then all the return handling will be reachable code;
    • if the tail call CAN be handled, THEN we just remove the useless instructions between the tail-call and the ret instruction during Finalize (postRegAlloc()), as it's dead code anyway.

[*] fun aside: this made isolating the issue particularly frustrating because the same set of test was failing on ARM and magically passing on AMD... because (surprisingly, to me) in AMD the calling convention allows to pass more args via registers than ARM before falling back to the stack

evacchi avatar Jun 14 '25 12:06 evacchi