truffleruby icon indicating copy to clipboard operation
truffleruby copied to clipboard

Reducing the number of AST nodes per method

Open eregon opened this issue 3 years ago • 14 comments

I'll look into reducing the number of AST nodes we produce for Ruby code. Less nodes means less footprint, less deep (Java) stacks, faster interpreter, and maybe slightly faster PE.

We can likely merge much of the method prelude into a RubyMethodRootNode, etc and have specific RubyRootNode subclasses for method/proc/lambda. => done

Potentially reading arguments could be done in a single node vs 2+ nodes per argument. We'd still need a WriteFrameSlotNode if we write the argument to a frame variable (which represents the Ruby local variable). Maybe we don't even need to write arguments to frame variables, and could instead mutate the frame arguments array when the argument variable is reassigned. We'd need to be careful if something depends on the frame arguments array not changing. The trade-off is writing to Ruby variables corresponding to arguments would box (unless the caller inlines us).

There is currently a pattern of having SelfNode and storing+reading self from a frame variable, this seems worth reconsidering. Unboxing self is probably of limited value (Graal should be able to see it needs to unbox only once anyway), and reading from the arguments could avoid needing an explicit SelfNode and would remove the ReadFrameSlotNode*/WriteFrameSlotNode for it.

We could try to remove GetDefaultDefineeNode and replace it with null or some marker, and similarly the SingletonClassNode for def obj.method, where we could just use an uncached one. Method definitions (with def) should need very few nodes.

Recent change related to that: https://github.com/oracle/truffleruby/commit/b1e450651be0b2401f55b6a2053ad8bfd77334d1

eregon avatar Feb 19 '21 15:02 eregon

Can you notify me about any PRs you have for this and I can easily do a before-and-after of the stack depth and total nodes in the system for our application.

chrisseaton avatar Feb 20 '21 00:02 chrisseaton

https://github.com/oracle/truffleruby/commit/b6c91eccb46a0b03f9f4cf04f6fb5fbbbb591ae6 goes in this direction and always inlines generated accessors, Module#attr_* themselves, visibility methods (private/protected/public) and #using. That probably does not affect stack depth much as they are all leaf methods. It should reduce the number of nodes to execute/PE quite a bit though, and avoid walking the stack for quite a few cases.

eregon avatar Feb 22 '21 17:02 eregon

I'm making progress on this, removing the CatchFor*Node and ExceptionTranslatingNode for method/lambda/proc and moving that logic inline in subclasses of RubyRootNode: 4e900526025b8116600329be2b9f1a2503ae90dc

For lambdas we have a slightly unfortunate arrangement where we have 2 SequenceNode due to CheckArityNode executing the child:

      RubyLambdaRootNode
        body = SequenceNode
          body[0] = CheckArityNode
            body = SequenceNode
              body[0] = WriteLocalVariableNode
                valueNode = ProfileArgumentNodeGen
                  child_ = ReadSelfNode
              body[1] = WriteLocalVariableNode
                valueNode = ProfileArgumentNodeGen
                  child_ = ReadPreArgumentNode
              body[2] = NilLiteralNode
          body[1] = ReadLocalVariableNode

So probably CheckArityNode should avoid having a child node, which would also reduce how much stack is used. Potentially we can even inline CheckArityNode in RubyRootNode subclasses, at least for the simple arity cases.

eregon avatar Mar 29 '21 14:03 eregon

Let me know if you want a before-and-after measurement for a given PR.

chrisseaton avatar Mar 29 '21 22:03 chrisseaton

There were 2 PRs here: 4e90052 and 4abc6b6e68a62a4956f7bd58a52b57741cf7b254 (complete diff: 14322c1778b03279ea1b4efb71bb1dfa8a8d0e53...4abc6b6e68a62a4956f7bd58a52b57741cf7b254) So it would be interesting to compare before them at 14322c1778b03279ea1b4efb71bb1dfa8a8d0e53 and after them at 4abc6b6e68a62a4956f7bd58a52b57741cf7b254.

Those PRs should reduce the total number of nodes for method & block preludes, as well as decrease stack depth, since CatchFor*Node, ExceptionTranslatingNode and CheckArityNode are now handled directly in the RootNode#execute.

eregon avatar Apr 01 '21 17:04 eregon

I'll reopen because only part of the ideas here were done.

eregon avatar Apr 14 '21 13:04 eregon

We noticed 20-25% improvements on the deltablue and richards in interpreter-only mode (Native CE & EE, --engine.Compilation=false) for 2e3bf7d50321a5ffc48906829e7bb1e2a4bece53...5474e163faf1aff6cea1935450b198536e2d8b19. That's quite a few commits, but it notably includes b1e4506 and b6c91ec.

cc @smarr

eregon avatar Apr 15 '21 12:04 eregon

It's actually even more visible on JVM, but that has the caveat it might inline some virtual calls in the interpreter due to these benchmarks being relatively small. So generally we recommend comparing on Native for interpreter for interpreter performance. On 21.0.0 JVM CE I get ~600ms per iteration for richards. On master JVM CE I get ~350ms per iteration for richards. (I'm using jt -u CONFIG ruby --experimental-options --engine.Compilation=false bench/polybench.rb ../graal/vm/benchmarks/interpreter/richards.rb, will add the harness soon)

eregon avatar Apr 15 '21 12:04 eregon

Thanks for the CC.

Similar to https://github.com/oracle/truffleruby/commit/b6c91eccb46a0b03f9f4cf04f6fb5fbbbb591ae6 I did recently experiment with inlining getters/setters (like HotSpot does, I believe) and a few other basic things, like methods returning constants or globals, and object instantiation. Inlining here means, I essentially integrate these trivial nodes into the dispatch chain directly, instead of having to go through a Truffle method call.

https://github.com/SOM-st/TruffleSOM/pull/57

The gains on DeltaBlue are not as big, just 10%. On Richards it's 28% and NBody 36%.

Performance results: https://rebench.stefan-marr.de/compare/SOMns/d53f2b50adf3705fc19bd9bde50c2e8ea81e7924/a2ff3f7704d2fa110e6622e68d6126f4cd98685b

The largest program I have in these benchmarks, is the SomSom interpreter (SOM written in SOM), which sees 8-10% benefit.

smarr avatar Apr 15 '21 13:04 smarr

Inlining here means, I essentially integrate these trivial nodes into the dispatch chain directly, instead of having to go through a Truffle method call.

Right, I call that "AST inlining", not sure if there is a better term.

Interesting, are those results on Native (best, most representative) or JVM? And if JVM is it with --engine.Compilation=false (better, more representative) or by not including /compiler (not representative of interpreter perf in releases)?

FWIW we AST-inline quite a few calls in TruffleRuby, including new as well: https://github.com/oracle/truffleruby/blob/master/src/main/java/org/truffleruby/core/inlined/CoreMethodAssumptions.java Those are detected at parse/translate time, if it's a literal call to one of these.

And there is also AlwaysInlinedMethodNode, which are AST-inlined methods in the "call method node" which is more dynamic and notably used for getters/setters/send and a couple things that would otherwise need to get the caller frame, but by being AST-inlined have it right at hand. For this category they are always AST-inlined, which is nice as we don't need to handle them not being AST-inlined (e.g., we never need to walk the stack to get the caller frame for those).

eregon avatar Apr 15 '21 13:04 eregon

Right, I call that "AST inlining", not sure if there is a better term.

Yepp, AST inlinining seems suitable. The linked PR does it actually also for a bytecode interpreter, but yes, same idea.

are those results on Native (best, most representative) or JVM?

Yes ;)

This info is hidden in the executor names

  • native-interp-ast: a native image, with engine.Compilation=false for the AST interpreter
  • native-interp-bc: a native image, with engine.Compilation=false for the BC interpreter
  • -interp: on top of HotSpot, with the Truffle default runtime, AST interpreter
  • -graal: on top of HotSpot, with Graal as normally enabled, BC interpreter

The numbers I referred to were for the AST interpreter in a native image, with engine.Compilation=false

smarr avatar Apr 15 '21 13:04 smarr

Did you want me to run this again?

chrisseaton avatar Apr 19 '21 23:04 chrisseaton

@chrisseaton No, there was no change in this area since https://github.com/oracle/truffleruby/issues/2261#issuecomment-812049762

eregon avatar Apr 20 '21 10:04 eregon

This blog post details Multi-Tier and also shows the effect of the interpreter optimizations in this issue: https://medium.com/graalvm/multi-tier-compilation-in-graalvm-5fbc65f92402#a16c (under Interpreter impact if the anchor doesn't work)

eregon avatar Apr 20 '21 16:04 eregon

With Truffle Node inlining there is less need to reduce the number of nodes, and we did some of this so closing.

eregon avatar Mar 13 '24 17:03 eregon