wren Optimization idea for the VM: tail calls

I just came across Wren today and it looks very cool. I have been a Lua fan for a while, but I always wished for statically defined classes.

I notice that Wren has a small implementation and also has a focus on speed. For this reason I wanted to point you towards an idea that I think would likely deliver significant performance benefits, and would be reasonably easy to experiment with: https://blog.reverberate.org/2021/04/21/musttail-efficient-interpreters.html

With this approach I think it is possible to implement an interpreter that can match the tight assembly sequences of LuaJIT 2's interpreter without actually having to write in assembly (an example of this is given in the article). This approach is also taken by the wasm3 WebAssembly interpreter.

You'll need __attribute__((musttail)) (just released in Clang 13) if you want to avoid blowing the stack in debug mode. But with some macros, I think you could potentially structure the code such that you can switch between this and computed goto easily.

Just an idea, feel free to ignore, but I wanted to send it your way. :)

Oct 11 '21 07:10 haberman

I was thinking about it for a while, but as of current implementation it has quite some implications.

In the VM it would means we loose ~32 opcodes values (probably more to adjust stack attributes?) and need to branch call code or duplicate it.
In the compiler, it would means we need to add support for possibly "tail" and "musttail" attributes (or what ever solution we adopt), where it would be safe to ignore them.
It can cut of stack layers, making stack trace harder to follow when enabled, leading to a fragmentation in usage.

Because of the first point, I don't think it is a good idea to accept such change without good numbers in benchmarks. In addition, wren does not really have a stack size limit (as of now), so one should not really hit a stack overflow unless he messed up and triggered an infinite recursion. But enabling it would allow to reduce the stress on stack memory allocation.

It is a nice feature that would definitively have a great potential to speed a the VM a lot, but 3rd point still tickles me too much unless we can prove some solid improvements in speeds like more than 2 times faster.

Oct 11 '21 07:10 mhermier

I think there may be a misunderstanding. I'm not talking about tail calls for Wren functions, or any change to the Wren language. I'm talking about using tail calls in the Wren interpreter only (ie. only in runInterpreter() in wren_vm.c()).

Instead of:

  #define DISPATCH()                                                           \
      do                                                                       \
      {                                                                        \
        DEBUG_TRACE_INSTRUCTIONS();                                            \
        goto *dispatchTable[instruction = (Code)READ_BYTE()];                  \
      } while (false)

You can do:

  #define DISPATCH()                                                           \
      do                                                                       \
      {                                                                        \
        DEBUG_TRACE_INSTRUCTIONS();                                            \
        return __attribute__((musttail)) \
            dispatchTable[(Code)READ_BYTE()](vm, fiber, frame, stackStart, ip, fn);                  \
      } while (false)

Each VM opcode would be implemented in a separate function that takes the interpreter state (vm, fiber, etc) as parameters. This will help vm, fiber, frame, etc. stay in registers, and make the interpreter faster with no changes to the Wren language.

You can see an example of how wasm3 applied this approach: https://github.com/wasm3/wasm3/blob/main/docs/Interpreter.md#m3-massey-meta-machine

Oct 11 '21 18:10 haberman

This would change the structure of the dispatch function by creating a plethora of functions. While there would be functional change, what I understand of the current situation is that the code fits nicely in a few cache pages making the code fast, to the point that even tweaking some opcodes can trigger some dramatic performance changes (at least on my system). So I suspect this change would have unpredictable performance implications.

In addition it would rely on the presence of a non standard extension, that would not resolve nicely if ignored since it would stack overflow the host.

While I understand the motivation for larger projects, I think relying on such kind of extension is not the smartest move since it reduce the compiler list. And there are cases, in particular if your project is embedded, where one has to use an exotic compiler with limited language support, and usually poor extension support.

Oct 12 '21 06:10 mhermier

While I understand the motivation for larger projects, I think relying on such kind of extension is not the smartest move since it reduce the compiler list.

I think you might be able to use musttail conditionally, just as you conditionally use computed goto now.

You already have your ops broken apart into quasi-functions, eg.

    CASE_CODE(LOOP):
    {
      // Jump back to the top of the loop.
      uint16_t offset = READ_SHORT();
      ip -= offset;
      DISPATCH();
    }

You could make CASE_CODE() something like this when musttail is available:

#define CASE_CODE(opcode) \
    int op_##opcode(WrenVM* vm, ObjFiber* fiber,  CallFrame* frame,
        Value* stackStart, uint8_t* ip, ObjFn* fn)

Then you can define the surrounding code so that CASE_CODE() either occurs in global context (when musttail is available) or inside a switch() statement in the runInterpreter() function (when musttail is not available).

I won't say it's trivial, but I think the resulting code could be quite portable, elegant, and fast.

While there would be functional change, what I understand of the current situation is that the code fits nicely in a few cache pages making the code fast

I expect that overall code size would stay about the same or even shrink.

As an example, take your LOOP op. Compiled as a function in this style, it compiles to only 11 instructions. Putting the code into a separate function has not introduced any size overhead: https://godbolt.org/z/EGv7hq8ov

to the point that even tweaking some opcodes can trigger some dramatic performance changes (at least on my system). So I suspect this change would have unpredictable performance implications.

I expect that performance (when musttail is available) would get more predictable, because each op would get optimized independently. When all the ops are in one big function, any small change to one op can change the compiler's register allocation for the entire function. If each op is in its own function, each op is optimized independently, so tweaking one op cannot affect the code generated for other ops.

I totally understand if this idea doesn't appeal to you. If I find some time, I may try this out on a branch to fulfill my own curiosity. If I get any good results I'll send you a PR, you can do with it what you will. :)

Oct 12 '21 08:10 haberman

It is not a question of appeal. Unless you prove my arguments to be wrong, the trade of seems not worth as this. If the impact in terms of code organization is low/null and performance is good enough (let's say ~10% more bad in worse case scenarios), we can start considering supporting it.

As this, fully supporting it would rely on 2 compiler extensions (inner functions and tailforce) and this it is my biggest point against the adoption. But if performance is better than my expectations, it is really worth it. In particular if it also brings a little bit of stability when tweaking opcodes.

Oct 12 '21 08:10 mhermier

Yes, it seems to me like if the patch was simple and non-invasive (if it would somehow redefine existing macros, as mentioned above) - and made Wren measurably faster on some platforms and more stable (in terms of performance from patch to patch) then if it was an optional compiler flag that'd be pretty sweet.

Looking forward to seeing a PR with some benchmarks if you find the time, @haberman

Oct 12 '21 11:10 joshgoebel

Thanks for the info @haberman, it sounds pretty interesting for sure. Despite the compiler specific extensions (which we already rely on for e.g computed goto, specifically cos of perf) I would imagine it's worth investigating.

The interpreter loop is quite small and a messy quick version would probably be able to answer what kind of benefits we'd see from the idea, without it being done neatly or in any formal way. That would be enough to know whether it's worth the neater abstractions/optional flags later.

As maintainer of Wren I would encourage you/others to try it out and let us know (sometimes comments from the community may be misinterpreted as speaking for the project itself, which is not the case).

Oct 12 '21 19:10 ruby0x1

wren wren copied to clipboard

Optimization idea for the VM: tail calls

wren
wren copied to clipboard