cperl
cperl copied to clipboard
tracing jit
The jit will be a tracing jit, not a method jit. A tracing jit is slower, much more complex and needs more profiling state, but needs much less memory, esp. on such dynamic apps with lot of dead code never executed. We want to trace calls and loops, similar to v8. For us memory is more important than performance. A perl5 jit has not much benefits, as the ops are way too dynamic, so we mostly just win with the icache, having the op calls aligned one after another in memory, and not jumping to random heap locations. With more and more type information a real jit, going into the ops would be worthwhile. e.g. for typed native arrays or native arith.
dynasm is currently the easiest, as it allows more archs, moar already uses it. But you have to write your insn manually, not abstract as in libjit or asmjit. but it supports other more important abstraction, like types and slots, ... one nice thing would be to replace the dynasm.lua preprocessor with a simple perl script. this would need max 2 days.
See e.g. https://github.com/imasahiro/rujit/ for a memory hungry tracing jit, 2-3x faster.
We will always have the fallback to use the huge llvm jit, but I'm sceptical that it's fast enough with its overhead. Testing it at first, as libcperl.bc can be imported and used for LTO and inlining.
The experimental guile tracing jit nash looks better: https://github.com/8c6794b6/guile-tjit-documentation/blob/master/nash.pdf
We also need the jit for the ffi, so we can omit libffi, and just go with the jit.
But first we will start with a very simple method jit in LLVM, to benchmark the cost/benefit ratio for the simple
PL_op = Perl_pp_enter(aTHX);
PL_op = Perl_pp_nextstate(aTHX);
...
PL_op = Perl_pp_leave(aTHX);
linearization, and do the simple and easiest op optimizations at first. esp. nextstate which is currently the most costly op, esp. unneeded stack reset on every single line. The jit knows the stack depth for most simple ops, and can easily bypass that (#18). The jit also knows about locals and tainted vars.
Then we can start counting calls and loops, and switch between the jit and bytecode runloop, if beneficial. The question is if the LLVM optimizer can inline the ops, or if it needs the IR of it. e.g. unladen_swallow needed to compile a complete libpython.bc runtime, and still needed a huge and slow LLVM abstraction library to emit the IR.
See the feature/gh220-llvmjit branch.