quickjs icon indicating copy to clipboard operation
quickjs copied to clipboard

Tail-call dispatch

Open ivankra opened this issue 1 month ago • 4 comments

I'd like to suggest to implement tail-call dispatch in QuickJS. Quick demo of what it is.

This got recently done to CPython with great success, +5-10% performance. I've slapped together a wip patch for QuickJS and here are some preliminary results (Debian 13 arm64 VM on Mac M4 with clang 19.1.7, median of 10 runs):

Test b2268561 2dcc05b1 % p_welch
Richards 1799.5 1968 +9.36% 0.0000*
DeltaBlue 1872.5 1836 -1.95% 0.0000*
Crypto 2033 2472 +21.59% 0.0288*
RayTrace 3408.5 3652 +7.14% 0.0000*
EarleyBoyer 4052 4257.5 +5.07% 0.5588
RegExp 1012 1023 +1.09% 0.5063
Splay 5693.5 5813 +2.10% 0.8515
SplayLatency 19752 20219 +2.36% 0.0019*
NavierStokes 3730 5150.5 +38.08% 0.0000*
Geomean 3247 3534 +8.84%

The diff is somewhat large (914+ 659-), but the bulk of it are harmless formatting changes to make CASE blocks less interdependent and splittable into separate functions. Beside tail call dispatch, being able to split them up like that could be also useful for experimenting with adding some JIT.

ivankra avatar Dec 04 '25 11:12 ivankra

Nice! Question: is the enlargement of the stack frame structure not a concern?

saghul avatar Dec 04 '25 11:12 saghul

I added 7 fields to JS_StackFrame - it was just the easiest way for me to pass them through to code inside CASE blocks while not having access to JS_CallInternal's variables, but likely there's room to optimize there. Note I also eliminated 4 of those variables, which should somewhat help.

ivankra avatar Dec 04 '25 11:12 ivankra

Oh and there would be of course some more stack usage at any transitions out of tail-callers where they need to spill stuff to stack, but I haven't measured it. This would be a more fundamental cost of this approach, but the performance gains probably justify it, especially since it's easy to turn it off at compile time if needed

ivankra avatar Dec 04 '25 11:12 ivankra

Interesting. On x86_64 I measured a (small) speedup of 3.5% after having removed one parameter to the function opcodes (otherwise there are not enough saved registers). The main benefit seems that the generated code is less dependent on regressions of performance due to the varying compiler optimizations which are difficult to predict on large functions.

bellard avatar Dec 22 '25 14:12 bellard