Tail-call dispatch
I'd like to suggest to implement tail-call dispatch in QuickJS. Quick demo of what it is.
This got recently done to CPython with great success, +5-10% performance. I've slapped together a wip patch for QuickJS and here are some preliminary results (Debian 13 arm64 VM on Mac M4 with clang 19.1.7, median of 10 runs):
| Test | b2268561 | 2dcc05b1 |
% | p_welch |
|---|---|---|---|---|
| Richards | 1799.5 | 1968 | +9.36% | 0.0000* |
| DeltaBlue | 1872.5 | 1836 | -1.95% | 0.0000* |
| Crypto | 2033 | 2472 | +21.59% | 0.0288* |
| RayTrace | 3408.5 | 3652 | +7.14% | 0.0000* |
| EarleyBoyer | 4052 | 4257.5 | +5.07% | 0.5588 |
| RegExp | 1012 | 1023 | +1.09% | 0.5063 |
| Splay | 5693.5 | 5813 | +2.10% | 0.8515 |
| SplayLatency | 19752 | 20219 | +2.36% | 0.0019* |
| NavierStokes | 3730 | 5150.5 | +38.08% | 0.0000* |
| Geomean | 3247 | 3534 | +8.84% |
The diff is somewhat large (914+ 659-), but the bulk of it are harmless formatting changes to make CASE blocks less interdependent and splittable into separate functions. Beside tail call dispatch, being able to split them up like that could be also useful for experimenting with adding some JIT.
Nice! Question: is the enlargement of the stack frame structure not a concern?
I added 7 fields to JS_StackFrame - it was just the easiest way for me to pass them through to code inside CASE blocks while not having access to JS_CallInternal's variables, but likely there's room to optimize there. Note I also eliminated 4 of those variables, which should somewhat help.
Oh and there would be of course some more stack usage at any transitions out of tail-callers where they need to spill stuff to stack, but I haven't measured it. This would be a more fundamental cost of this approach, but the performance gains probably justify it, especially since it's easy to turn it off at compile time if needed
Interesting. On x86_64 I measured a (small) speedup of 3.5% after having removed one parameter to the function opcodes (otherwise there are not enough saved registers). The main benefit seems that the generated code is less dependent on regressions of performance due to the varying compiler optimizations which are difficult to predict on large functions.