truffleruby
truffleruby copied to clipboard
Reducing the number of AST nodes per method
I'll look into reducing the number of AST nodes we produce for Ruby code. Less nodes means less footprint, less deep (Java) stacks, faster interpreter, and maybe slightly faster PE.
We can likely merge much of the method prelude into a RubyMethodRootNode, etc and have specific RubyRootNode subclasses for method/proc/lambda. => done
Potentially reading arguments could be done in a single node vs 2+ nodes per argument. We'd still need a WriteFrameSlotNode if we write the argument to a frame variable (which represents the Ruby local variable). Maybe we don't even need to write arguments to frame variables, and could instead mutate the frame arguments array when the argument variable is reassigned. We'd need to be careful if something depends on the frame arguments array not changing. The trade-off is writing to Ruby variables corresponding to arguments would box (unless the caller inlines us).
There is currently a pattern of having SelfNode and storing+reading self
from a frame variable, this seems worth reconsidering. Unboxing self
is probably of limited value (Graal should be able to see it needs to unbox only once anyway), and reading from the arguments could avoid needing an explicit SelfNode and would remove the ReadFrameSlotNode*/WriteFrameSlotNode for it.
We could try to remove GetDefaultDefineeNode and replace it with null
or some marker, and similarly the SingletonClassNode for def obj.method
, where we could just use an uncached one. Method definitions (with def
) should need very few nodes.
Recent change related to that: https://github.com/oracle/truffleruby/commit/b1e450651be0b2401f55b6a2053ad8bfd77334d1
Can you notify me about any PRs you have for this and I can easily do a before-and-after of the stack depth and total nodes in the system for our application.
https://github.com/oracle/truffleruby/commit/b6c91eccb46a0b03f9f4cf04f6fb5fbbbb591ae6 goes in this direction and always inlines generated accessors, Module#attr_* themselves, visibility methods (private/protected/public) and #using. That probably does not affect stack depth much as they are all leaf methods. It should reduce the number of nodes to execute/PE quite a bit though, and avoid walking the stack for quite a few cases.
I'm making progress on this, removing the CatchFor*Node
and ExceptionTranslatingNode
for method/lambda/proc and moving that logic inline in subclasses of RubyRootNode: 4e900526025b8116600329be2b9f1a2503ae90dc
For lambdas we have a slightly unfortunate arrangement where we have 2 SequenceNode due to CheckArityNode executing the child:
RubyLambdaRootNode
body = SequenceNode
body[0] = CheckArityNode
body = SequenceNode
body[0] = WriteLocalVariableNode
valueNode = ProfileArgumentNodeGen
child_ = ReadSelfNode
body[1] = WriteLocalVariableNode
valueNode = ProfileArgumentNodeGen
child_ = ReadPreArgumentNode
body[2] = NilLiteralNode
body[1] = ReadLocalVariableNode
So probably CheckArityNode should avoid having a child node, which would also reduce how much stack is used. Potentially we can even inline CheckArityNode in RubyRootNode subclasses, at least for the simple arity cases.
Let me know if you want a before-and-after measurement for a given PR.
There were 2 PRs here: 4e90052 and 4abc6b6e68a62a4956f7bd58a52b57741cf7b254 (complete diff: 14322c1778b03279ea1b4efb71bb1dfa8a8d0e53...4abc6b6e68a62a4956f7bd58a52b57741cf7b254) So it would be interesting to compare before them at 14322c1778b03279ea1b4efb71bb1dfa8a8d0e53 and after them at 4abc6b6e68a62a4956f7bd58a52b57741cf7b254.
Those PRs should reduce the total number of nodes for method & block preludes, as well as decrease stack depth, since CatchFor*Node
, ExceptionTranslatingNode
and CheckArityNode
are now handled directly in the RootNode#execute.
I'll reopen because only part of the ideas here were done.
We noticed 20-25% improvements on the deltablue and richards in interpreter-only mode (Native CE & EE, --engine.Compilation=false
) for 2e3bf7d50321a5ffc48906829e7bb1e2a4bece53...5474e163faf1aff6cea1935450b198536e2d8b19.
That's quite a few commits, but it notably includes b1e4506 and b6c91ec.
cc @smarr
It's actually even more visible on JVM, but that has the caveat it might inline some virtual calls in the interpreter due to these benchmarks being relatively small.
So generally we recommend comparing on Native for interpreter for interpreter performance.
On 21.0.0 JVM CE I get ~600ms per iteration for richards.
On master JVM CE I get ~350ms per iteration for richards.
(I'm using jt -u CONFIG ruby --experimental-options --engine.Compilation=false bench/polybench.rb ../graal/vm/benchmarks/interpreter/richards.rb
, will add the harness soon)
Thanks for the CC.
Similar to https://github.com/oracle/truffleruby/commit/b6c91eccb46a0b03f9f4cf04f6fb5fbbbb591ae6 I did recently experiment with inlining getters/setters (like HotSpot does, I believe) and a few other basic things, like methods returning constants or globals, and object instantiation. Inlining here means, I essentially integrate these trivial nodes into the dispatch chain directly, instead of having to go through a Truffle method call.
https://github.com/SOM-st/TruffleSOM/pull/57
The gains on DeltaBlue are not as big, just 10%. On Richards it's 28% and NBody 36%.
Performance results: https://rebench.stefan-marr.de/compare/SOMns/d53f2b50adf3705fc19bd9bde50c2e8ea81e7924/a2ff3f7704d2fa110e6622e68d6126f4cd98685b
The largest program I have in these benchmarks, is the SomSom interpreter (SOM written in SOM), which sees 8-10% benefit.
Inlining here means, I essentially integrate these trivial nodes into the dispatch chain directly, instead of having to go through a Truffle method call.
Right, I call that "AST inlining", not sure if there is a better term.
Interesting, are those results on Native (best, most representative) or JVM? And if JVM is it with --engine.Compilation=false
(better, more representative) or by not including /compiler
(not representative of interpreter perf in releases)?
FWIW we AST-inline quite a few calls in TruffleRuby, including new
as well:
https://github.com/oracle/truffleruby/blob/master/src/main/java/org/truffleruby/core/inlined/CoreMethodAssumptions.java
Those are detected at parse/translate time, if it's a literal call to one of these.
And there is also AlwaysInlinedMethodNode, which are AST-inlined methods in the "call method node" which is more dynamic and notably used for getters/setters/send and a couple things that would otherwise need to get the caller frame, but by being AST-inlined have it right at hand. For this category they are always AST-inlined, which is nice as we don't need to handle them not being AST-inlined (e.g., we never need to walk the stack to get the caller frame for those).
Right, I call that "AST inlining", not sure if there is a better term.
Yepp, AST inlinining seems suitable. The linked PR does it actually also for a bytecode interpreter, but yes, same idea.
are those results on Native (best, most representative) or JVM?
Yes ;)
This info is hidden in the executor names
- native-interp-ast: a native image, with
engine.Compilation=false
for the AST interpreter - native-interp-bc: a native image, with
engine.Compilation=false
for the BC interpreter - -interp: on top of HotSpot, with the Truffle default runtime, AST interpreter
- -graal: on top of HotSpot, with Graal as normally enabled, BC interpreter
The numbers I referred to were for the AST interpreter in a native image, with engine.Compilation=false
Did you want me to run this again?
@chrisseaton No, there was no change in this area since https://github.com/oracle/truffleruby/issues/2261#issuecomment-812049762
This blog post details Multi-Tier and also shows the effect of the interpreter optimizations in this issue:
https://medium.com/graalvm/multi-tier-compilation-in-graalvm-5fbc65f92402#a16c (under Interpreter impact
if the anchor doesn't work)
With Truffle Node inlining there is less need to reduce the number of nodes, and we did some of this so closing.