binaryen icon indicating copy to clipboard operation
binaryen copied to clipboard

[DO NOT LAND] Experiment with using pops in the current interpreter

Open kripken opened this issue 5 months ago • 2 comments

Uploading this just for future reference and discussion.

Our current interpreter is pretty efficient, it turns out: It interprets the IR in-place, without constructing any new IR, and while doing so it does this at each expression, e.g. on Binary

visitBinary(curr) {
  var left = interpret(curr->left);
  var right = interpret(curr->right);
  return left + right; // For binary, we need all the others too, - * / etc.
}

The point of this simplified pseudocode is that each execution of an expression ends up doing three things:

  • Call the right visit*() method, e.g., for a Binary we call visitBinary. (this does a switch on the expression id)
  • Get the children. (here this is done by doing curr->left etc.)
  • Execute the instruction on the children. (here this is done with the +)

What is efficient here is that we do a single switch on the expression id, and once we are in the proper visit*() method, we then have all the specific logic there: both finding the children, and executing after them.

This PR experiments with splitting these operations, using ChildVisitor to find the children - this does one switch - and then, once we have them, call the right visit*() - this does another switch - which then pops those values (called getChild() in this code). That is, this uses a value stack, like the wip new interpreter.

The benefit of this separation is that we can handle control flow in a generic place, no more if (breaking()) return, and this organization of code is a step to adding stack switching support (we can handle switching in that single generic place; we can save and restore the value stack; etc.).

However, this makes the interpreter 2x slower. I tried various ways to optimize it, but it does seem that fundamentally the current interpreter is just very efficient, and doing two switches on the expression class is just worse, in addition to the stack itself etc. (but the stack seems less of an issue). And I believe this experiment is a good model for the performance of the wip new interpreter @tlively , unless I am missing something.

Perhaps we can still avoid two switches. The first switch could find not only the children but also put the proper visit*() method on the stack, to be fetched and executed at the proper time. But that will still be an unpredictable indirect call, so I am not optimistic.

Anyhow, this is not the end of the world - while this would make Precompute 2x slower, the total execution time may not be that bad, it is just one pass among many - but I am still looking into further options here.

kripken avatar Jul 28 '25 19:07 kripken

It's nice to see how much boilerplate can be removed by using an explicit stack, even in this minimal change to the existing interpreter. I think the 2x performance difference is more or less a worst-case bound, right? There's still some baggage here that a finished stack-based interpreter would not use, such as Flow and Literals. We could also experiment with different iterator designs and with tail calls, etc. to try to eke out more performance. I'm not looking at the profiles you're looking at, but I hope there's room for improvement.

tlively avatar Jul 28 '25 20:07 tlively

I'm not sure if it's a worst-case bound, but it is what I saw on a handful of large/realistic wasm files.

Yes, maybe we can optimize a little more here, but as I said, the big slowdown is just the two switches, as best I can measure (I did things like see the cost of adding a value stack without handling using another switch). It's double the indirect calls compared to the current interpreter. For an interpreter, which does little real work in each iteration, I think an indirect call is just expensive.

kripken avatar Jul 28 '25 20:07 kripken

This was implemented in an Asyncify-like manner elsewhere.

kripken avatar Dec 11 '25 18:12 kripken