scala3 icon indicating copy to clipboard operation
scala3 copied to clipboard

Fix #23224: Optimize simple tuple extraction

Open noti0na1 opened this issue 6 months ago • 1 comments

Fix #23224:

This PR optimizes simple tuple extraction by avoiding unnecessary tuple allocations and refines the typing of bind patterns for named tuples.

  • Update typedBind to use pt if the pattern is named tuple.
  • Optimise makePatDef to reduce tuple creation when a pattern uses only simple variables or wildcards.

For example:

def f1: (Int, Int, Int) = (1, 2, 3)
def test1 =
  val (a, b, c) = f1
  a + b + c

Before this PR:

val $1$: (Int, Int, Int) =
  this.f1:(Int, Int, Int) @unchecked match 
    {
      case Tuple3.unapply[Int, Int, Int](a @ _, b @ _, c @ _) =>
        Tuple3.apply[Int, Int, Int](a, b, c)
    }
val a: Int = $1$._1
val b: Int = $1$._2
val c: Int = $1$._3
a + b + c

After this PR:

val $2$: (Int, Int, Int) =
  this.f1:(Int, Int, Int) @unchecked match 
    {
      case $1$ @ Tuple3.unapply[Int, Int, Int](_, _, _) =>
        $1$:(Int, Int, Int)
    }
val a: Int = $2$._1
val b: Int = $2$._2
val c: Int = $2$._3
a + b + c

Also in genBCode now:

val $2$: Tuple3 =  
  matchResult1[Tuple3]:
    {
      case val x1: Tuple3 = this.f1():Tuple3
      if x1 ne null then
        {
          case val $1$: Tuple3 = x1
          return[matchResult1] $1$:Tuple3
        }
        else ()
      throw new MatchError(x1)
    }
val a: Int = Int.unbox($2$._1())
val b: Int = Int.unbox($2$._2())
val c: Int = Int.unbox($2$._3())
a + b + c

I use the regular expression (val\s*\(\s*[a-zA-Z_]\w*(\s*,\s*[a-zA-Z_]\w*)*\s*\)\s*=) to search in the compiler, and found 400+ places which are simple tuple extraction like this.

noti0na1 avatar Jun 16 '25 00:06 noti0na1

Split some change into a separate PR #23380 to diagnose the errors.

noti0na1 avatar Jun 16 '25 10:06 noti0na1

Added a byte code test to ensure there is no tuple creation in the generated code.

noti0na1 avatar Jun 26 '25 08:06 noti0na1

Absolutely, the JVM can often optimize many cases when it proves that a Tuple is pure and its temporary object doesn't escape. As a compiler, we should still try our best to generate efficient and non-redundent code possible.

Consider extracting values from a tuple that contains both an Int and a String. Previously, this required unboxing the Int twice, boxing it again, and creating a throwaway tuple, all for no benefit. This PR doesn’t add extra complexity; instead, it removes unnecessary work introduced by earlier design decision.

I chose to optimize tuple handling because tuples are widely used for return multiple values, as well as in compiler itself (see the regular text search results in the codebase). Ideally, we would extend this optimization to other data structure, but it is hard to do before typing.

noti0na1 avatar Jul 01 '25 11:07 noti0na1

As a compiler, we should still try our best to generate efficient and non-redundent code possible.

Not really. Our job is to generate code that the next compiler in line will be able to optimize. If you're generating assembly, you want to generate code that the processor will be happy about (no unpredictable branches, for example). If you're generating JVM bytecode, you want to generate code that the JVM will be happy about. There is a whole chain of compilers to think about.

This PR doesn’t add extra complexity; instead, it removes unnecessary work introduced by earlier design decision.

It generates code that is simpler. But the compiler code is definitely more complex. The increase of lines of code in the compiler is clear. These are new code paths, that behave in a special way in special situations.

If we are making the compiler more complex, but we are not measurably improving performance, it's a net loss.

sjrd avatar Jul 01 '25 13:07 sjrd

We could maybe benchmark on Scala Native? This might be easier to do. And if we get a net win there it would be a justification.

odersky avatar Jul 01 '25 13:07 odersky

As a chain of compilers, I think at each stage, we should generate code that makes "sense" with the best effort. We should not generate a bunch of non-sense allocations/boxing and rely on a blackbox to "optimize" every mistake.

So this is a principle problem for me and this PR partially fixes the problem, it's not about how many lines are added to the codebase.

noti0na1 avatar Jul 01 '25 13:07 noti0na1

We may benchmark this using jmh to monitor the tuple allocation number and memory usage, if someone can help?

noti0na1 avatar Jul 01 '25 13:07 noti0na1

My main issue with jmh is that JVM optimization is holistic - it might be that we see no win on some code bases and big wins on others. And we have no insight why. So instead of relying on jmh we should probably directly look at what kind of assembly the JIT produces for this. That's why I think scala native would be much easier to evaluate.

  • Martin

On Tue, Jul 1, 2025 at 3:37 PM noti0na1 @.***> wrote:

noti0na1 left a comment (scala/scala3#23373) https://github.com/scala/scala3/pull/23373#issuecomment-3024063125

We may benchmark this using jmh, if someone can help?

— Reply to this email directly, view it on GitHub https://github.com/scala/scala3/pull/23373#issuecomment-3024063125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGCKVQ6IGFITJYVAMTYZ4T3GKFIZAVCNFSM6AAAAAB7L4FRJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRUGA3DGMJSGU . You are receiving this because your review was requested.Message ID: @.***>

--

Martin Odersky Professor, Programming Methods Group (LAMP) Faculty IC, EPFL Station 14, Lausanne, Switzerland

odersky avatar Jul 01 '25 13:07 odersky

OK, I am able to write some recursive code with tuple to stress test the code, and my PR has measurable performance improvement.

// P(n) = P(n-2) + P(n-3)
def padovan(n: Int, v: (Int, Int, Int) = (1, 0, 0)): Int =
  val (v_n_3, v_n_2, v_n_1) = v
  if n == 0 then v_n_3
  else padovan(n - 1, (v_n_2, v_n_1, v_n_2 + v_n_3))


@main def Test =
  val n = 1000000000
  val i = padovan(n)
  println(s"padovan of $n is $i")

We can ignore the result because of overflow.

Without this PR: ~7.8s With this PR: ~5.2s

The results are from many runs.

noti0na1 avatar Jul 01 '25 14:07 noti0na1

@odersky @sjrd @noti0na1 With openjdk-23.0.1. Comparing the two functions below:

object JitTest:
  def f2(x0: Int, x1: Int, x2: Int, x3: Int): Int =
    val (y0, y1, y2, y3) = (x0 * 2, x1 * 3, x2 * 5, x3 * 7)

    y0 + y1 + y2 + y3

  def f4(x0: Int, x1: Int, x2: Int, x3: Int): Int =
    val y0 = x0 * 2
    val y1 = x1 * 3
    val y2 = x2 * 5
    val y3 = x3 * 7

    y0 + y1 + y2 + y3

Hotspot(C2) output (after millions of runs):

For 'f2':

[Instructions begin]
  0x000002261aedd580:   xchg   %ax,%ax
[Entry Point]
  # {method} {0x000002266e977ce0} 'f2' '(IIII)I' in 'app/JitTest$'
  # this:     rdx:rdx   = 'app/JitTest$'
  # parm0:    r8        = int
  # parm1:    r9        = int
  # parm2:    rdi       = int
  # parm3:    rsi       = int
  #           [sp+0x20]  (sp of caller)
  0x000002261aedd582:   mov    0x8(%rdx),%r10d
  0x000002261aedd586:   cmp    0x8(%rax),%r10d
  0x000002261aedd58a:   jne    0x000002261adee4e0           ;   {runtime_call ic_miss_stub}
[Verified Entry Point]
  0x000002261aedd590:   sub    $0x18,%rsp
  0x000002261aedd597:   mov    %rbp,0x10(%rsp)
  0x000002261aedd59c:   cmpl   $0x0,0x20(%r15)
  0x000002261aedd5a4:   jne    0x000002261aedd691           ;*synchronization entry
                                                            ; - app.JitTest$::f2@-1 (line 10)
  0x000002261aedd5aa:   shl    %r8d
  0x000002261aedd5ad:   lea    0x80(%r8),%r10d
  0x000002261aedd5b4:   movabs $0x601804ab0,%rcx            ;   {oop(a 'java/lang/Integer'[256] {0x0000000601804ab0})}
  0x000002261aedd5be:   cmp    $0x100,%r10d
  0x000002261aedd5c5:   jb     0x000002261aedd651           ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd5cb:   lea    (%r9,%r9,2),%eax             ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - app.JitTest$::f2@11 (line 10)
  0x000002261aedd5cf:   lea    0x80(%rax),%r11d
  0x000002261aedd5d6:   cmp    $0x100,%r11d
  0x000002261aedd5dd:   jb     0x000002261aedd666           ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd5e3:   lea    (%rdi,%rdi,4),%r11d          ;*imul {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - app.JitTest$::f2@17 (line 10)
  0x000002261aedd5e7:   lea    0x80(%r11),%r10d
  0x000002261aedd5ee:   cmp    $0x100,%r10d
  0x000002261aedd5f5:   jb     0x000002261aedd63f
  0x000002261aedd5f7:   lea    0x0(,%rsi,8),%r10d
  0x000002261aedd5ff:   sub    %esi,%r10d
  0x000002261aedd602:   lea    0x80(%r10),%ebx
  0x000002261aedd609:   cmp    $0x100,%ebx
  0x000002261aedd60f:   jb     0x000002261aedd62d
  0x000002261aedd611:   add    %r8d,%eax
  0x000002261aedd614:   add    %r11d,%eax
  0x000002261aedd617:   add    %r10d,%eax
  0x000002261aedd61a:   add    $0x10,%rsp
  0x000002261aedd61e:   pop    %rbp
  0x000002261aedd61f:   cmp    0x448(%r15),%rsp             ;   {poll_return}
  0x000002261aedd626:   ja     0x000002261aedd67b
  0x000002261aedd62c:   ret    
  0x000002261aedd62d:   movslq %r10d,%r10                   ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd630:   mov    0x210(%rcx,%r10,4),%r10d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@26 (line 10)
  0x000002261aedd638:   mov    0xc(%r12,%r10,8),%r10d       ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@69 (line 10)
  0x000002261aedd63d:   jmp    0x000002261aedd611
  0x000002261aedd63f:   movslq %r11d,%r10                   ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd642:   mov    0x210(%rcx,%r10,4),%r11d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@18 (line 10)
  0x000002261aedd64a:   mov    0xc(%r12,%r11,8),%r11d       ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@59 (line 10)
  0x000002261aedd64f:   jmp    0x000002261aedd5f7
  0x000002261aedd651:   movslq %r8d,%r10                    ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd654:   mov    0x210(%rcx,%r10,4),%r11d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd65c:   mov    0xc(%r12,%r11,8),%r8d        ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@39 (line 10)
  0x000002261aedd661:   jmp    0x000002261aedd5cb
  0x000002261aedd666:   movslq %eax,%r10                    ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd669:   mov    0x210(%rcx,%r10,4),%r10d     ;*aaload {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::valueOf@21 (line 1018)
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@12 (line 10)
  0x000002261aedd671:   mov    0xc(%r12,%r10,8),%eax        ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.lang.Integer::intValue@1 (line 1092)
                                                            ; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
                                                            ; - app.JitTest$::f2@49 (line 10)
  0x000002261aedd676:   jmp    0x000002261aedd5e3           ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
                                                            ; - app.JitTest$::f2@6 (line 10)
  0x000002261aedd67b:   movabs $0x2261aedd61f,%r10          ;   {internal_word}
  0x000002261aedd685:   mov    %r10,0x460(%r15)
  0x000002261aedd68c:   jmp    0x000002261adf53e0           ;   {runtime_call SafepointBlob}
  0x000002261aedd691:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x000002261aedd696:   jmp    0x000002261aedd5aa
  0x000002261aedd69b:   hlt    
  0x000002261aedd69c:   hlt    
  0x000002261aedd69d:   hlt    
  0x000002261aedd69e:   hlt    
  0x000002261aedd69f:   hlt    
[Exception Handler]
  0x000002261aedd6a0:   jmp    0x000002261ae27160           ;   {no_reloc}
[Deopt Handler Code]
  0x000002261aedd6a5:   call   0x000002261aedd6aa
  0x000002261aedd6aa:   subq   $0x5,(%rsp)
  0x000002261aedd6af:   jmp    0x000002261adf4680           ;   {runtime_call DeoptimizationBlob}
  0x000002261aedd6b4:   hlt    
  0x000002261aedd6b5:   hlt    
  0x000002261aedd6b6:   hlt    
  0x000002261aedd6b7:   hlt    
--------------------------------------------------------------------------------
[/Disassembly]

And for f4:

[Instructions begin]
  0x000002261aef9a00:   xchg   %ax,%ax
[Entry Point]
  # {method} {0x000002266e977f40} 'f4' '(IIII)I' in 'app/JitTest$'
  # this:     rdx:rdx   = 'app/JitTest$'
  # parm0:    r8        = int
  # parm1:    r9        = int
  # parm2:    rdi       = int
  # parm3:    rsi       = int
  #           [sp+0x20]  (sp of caller)
  0x000002261aef9a02:   mov    0x8(%rdx),%r10d
  0x000002261aef9a06:   cmp    0x8(%rax),%r10d
  0x000002261aef9a0a:   jne    0x000002261adee4e0           ;   {runtime_call ic_miss_stub}
[Verified Entry Point]
  0x000002261aef9a10:   sub    $0x18,%rsp
  0x000002261aef9a17:   mov    %rbp,0x10(%rsp)
  0x000002261aef9a1c:   cmpl   $0x0,0x20(%r15)
  0x000002261aef9a24:   jne    0x000002261aef9a6e           ;*synchronization entry
                                                            ; - app.JitTest$::f4@-1 (line 23)
  0x000002261aef9a2a:   lea    (%r9,%r9,2),%r11d
  0x000002261aef9a2e:   lea    (%rdi,%rdi,4),%r10d
  0x000002261aef9a32:   lea    (%r11,%r8,2),%r8d
  0x000002261aef9a36:   add    %r8d,%r10d
  0x000002261aef9a39:   lea    0x0(,%rsi,8),%eax
  0x000002261aef9a40:   sub    %esi,%eax
  0x000002261aef9a42:   add    %r10d,%eax                   ;*iadd {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - app.JitTest$::f4@32 (line 28)
  0x000002261aef9a45:   add    $0x10,%rsp
  0x000002261aef9a49:   pop    %rbp
  0x000002261aef9a4a:   cmp    0x448(%r15),%rsp             ;   {poll_return}
  0x000002261aef9a51:   ja     0x000002261aef9a58
  0x000002261aef9a57:   ret    
  0x000002261aef9a58:   movabs $0x2261aef9a4a,%r10          ;   {internal_word}
  0x000002261aef9a62:   mov    %r10,0x460(%r15)
  0x000002261aef9a69:   jmp    0x000002261adf53e0           ;   {runtime_call SafepointBlob}
  0x000002261aef9a6e:   call   Stub::nmethod_entry_barrier  ;   {runtime_call StubRoutines (final stubs)}
  0x000002261aef9a73:   jmp    0x000002261aef9a2a
[Exception Handler]
  0x000002261aef9a78:   jmp    0x000002261ae27160           ;   {no_reloc}
[Deopt Handler Code]
  0x000002261aef9a7d:   call   0x000002261aef9a82
  0x000002261aef9a82:   subq   $0x5,(%rsp)
  0x000002261aef9a87:   jmp    0x000002261adf4680           ;   {runtime_call DeoptimizationBlob}
  0x000002261aef9a8c:   hlt    
  0x000002261aef9a8d:   hlt    
  0x000002261aef9a8e:   hlt    
  0x000002261aef9a8f:   hlt    
--------------------------------------------------------------------------------
[/Disassembly]

It's not even close. In f2 we allocate, box, unbox, it's all over the place. You just can't rely on hotspot to always clean up. My vote is to merge.

nmichael44 avatar Jul 01 '25 15:07 nmichael44

Another case is val (a, b) = (b, a), hope this kind of case can be optimized too.

He-Pin avatar Jul 02 '25 03:07 He-Pin

@noti0na1 Hope this backported to 3.3.x too.

He-Pin avatar Jul 10 '25 03:07 He-Pin

@noti0na1 Hope this backported to 3.3.x too.

We will try, but might be hard to do. I will look into this after 20th. The new LTS is expected around the same as 3.7.3

tgodzik avatar Jul 10 '25 07:07 tgodzik

Looks like, this is breaking in LTS, it's probably an interaction with the changes to named tuples.

tgodzik avatar Jul 23 '25 12:07 tgodzik

I don't think this would be a candidate for LTS backporting.

odersky avatar Jul 23 '25 12:07 odersky