Fix #23224: Optimize simple tuple extraction
Fix #23224:
This PR optimizes simple tuple extraction by avoiding unnecessary tuple allocations and refines the typing of bind patterns for named tuples.
- Update
typedBindto useptif the pattern is named tuple. - Optimise
makePatDefto reduce tuple creation when a pattern uses only simple variables or wildcards.
For example:
def f1: (Int, Int, Int) = (1, 2, 3)
def test1 =
val (a, b, c) = f1
a + b + c
Before this PR:
val $1$: (Int, Int, Int) =
this.f1:(Int, Int, Int) @unchecked match
{
case Tuple3.unapply[Int, Int, Int](a @ _, b @ _, c @ _) =>
Tuple3.apply[Int, Int, Int](a, b, c)
}
val a: Int = $1$._1
val b: Int = $1$._2
val c: Int = $1$._3
a + b + c
After this PR:
val $2$: (Int, Int, Int) =
this.f1:(Int, Int, Int) @unchecked match
{
case $1$ @ Tuple3.unapply[Int, Int, Int](_, _, _) =>
$1$:(Int, Int, Int)
}
val a: Int = $2$._1
val b: Int = $2$._2
val c: Int = $2$._3
a + b + c
Also in genBCode now:
val $2$: Tuple3 =
matchResult1[Tuple3]:
{
case val x1: Tuple3 = this.f1():Tuple3
if x1 ne null then
{
case val $1$: Tuple3 = x1
return[matchResult1] $1$:Tuple3
}
else ()
throw new MatchError(x1)
}
val a: Int = Int.unbox($2$._1())
val b: Int = Int.unbox($2$._2())
val c: Int = Int.unbox($2$._3())
a + b + c
I use the regular expression (val\s*\(\s*[a-zA-Z_]\w*(\s*,\s*[a-zA-Z_]\w*)*\s*\)\s*=) to search in the compiler, and found 400+ places which are simple tuple extraction like this.
Split some change into a separate PR #23380 to diagnose the errors.
Added a byte code test to ensure there is no tuple creation in the generated code.
Absolutely, the JVM can often optimize many cases when it proves that a Tuple is pure and its temporary object doesn't escape. As a compiler, we should still try our best to generate efficient and non-redundent code possible.
Consider extracting values from a tuple that contains both an Int and a String. Previously, this required unboxing the Int twice, boxing it again, and creating a throwaway tuple, all for no benefit. This PR doesn’t add extra complexity; instead, it removes unnecessary work introduced by earlier design decision.
I chose to optimize tuple handling because tuples are widely used for return multiple values, as well as in compiler itself (see the regular text search results in the codebase). Ideally, we would extend this optimization to other data structure, but it is hard to do before typing.
As a compiler, we should still try our best to generate efficient and non-redundent code possible.
Not really. Our job is to generate code that the next compiler in line will be able to optimize. If you're generating assembly, you want to generate code that the processor will be happy about (no unpredictable branches, for example). If you're generating JVM bytecode, you want to generate code that the JVM will be happy about. There is a whole chain of compilers to think about.
This PR doesn’t add extra complexity; instead, it removes unnecessary work introduced by earlier design decision.
It generates code that is simpler. But the compiler code is definitely more complex. The increase of lines of code in the compiler is clear. These are new code paths, that behave in a special way in special situations.
If we are making the compiler more complex, but we are not measurably improving performance, it's a net loss.
We could maybe benchmark on Scala Native? This might be easier to do. And if we get a net win there it would be a justification.
As a chain of compilers, I think at each stage, we should generate code that makes "sense" with the best effort. We should not generate a bunch of non-sense allocations/boxing and rely on a blackbox to "optimize" every mistake.
So this is a principle problem for me and this PR partially fixes the problem, it's not about how many lines are added to the codebase.
We may benchmark this using jmh to monitor the tuple allocation number and memory usage, if someone can help?
My main issue with jmh is that JVM optimization is holistic - it might be that we see no win on some code bases and big wins on others. And we have no insight why. So instead of relying on jmh we should probably directly look at what kind of assembly the JIT produces for this. That's why I think scala native would be much easier to evaluate.
- Martin
On Tue, Jul 1, 2025 at 3:37 PM noti0na1 @.***> wrote:
noti0na1 left a comment (scala/scala3#23373) https://github.com/scala/scala3/pull/23373#issuecomment-3024063125
We may benchmark this using jmh, if someone can help?
— Reply to this email directly, view it on GitHub https://github.com/scala/scala3/pull/23373#issuecomment-3024063125, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAGCKVQ6IGFITJYVAMTYZ4T3GKFIZAVCNFSM6AAAAAB7L4FRJGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTAMRUGA3DGMJSGU . You are receiving this because your review was requested.Message ID: @.***>
--
Martin Odersky Professor, Programming Methods Group (LAMP) Faculty IC, EPFL Station 14, Lausanne, Switzerland
OK, I am able to write some recursive code with tuple to stress test the code, and my PR has measurable performance improvement.
// P(n) = P(n-2) + P(n-3)
def padovan(n: Int, v: (Int, Int, Int) = (1, 0, 0)): Int =
val (v_n_3, v_n_2, v_n_1) = v
if n == 0 then v_n_3
else padovan(n - 1, (v_n_2, v_n_1, v_n_2 + v_n_3))
@main def Test =
val n = 1000000000
val i = padovan(n)
println(s"padovan of $n is $i")
We can ignore the result because of overflow.
Without this PR: ~7.8s With this PR: ~5.2s
The results are from many runs.
@odersky @sjrd @noti0na1 With openjdk-23.0.1. Comparing the two functions below:
object JitTest:
def f2(x0: Int, x1: Int, x2: Int, x3: Int): Int =
val (y0, y1, y2, y3) = (x0 * 2, x1 * 3, x2 * 5, x3 * 7)
y0 + y1 + y2 + y3
def f4(x0: Int, x1: Int, x2: Int, x3: Int): Int =
val y0 = x0 * 2
val y1 = x1 * 3
val y2 = x2 * 5
val y3 = x3 * 7
y0 + y1 + y2 + y3
Hotspot(C2) output (after millions of runs):
For 'f2':
[Instructions begin]
0x000002261aedd580: xchg %ax,%ax
[Entry Point]
# {method} {0x000002266e977ce0} 'f2' '(IIII)I' in 'app/JitTest$'
# this: rdx:rdx = 'app/JitTest$'
# parm0: r8 = int
# parm1: r9 = int
# parm2: rdi = int
# parm3: rsi = int
# [sp+0x20] (sp of caller)
0x000002261aedd582: mov 0x8(%rdx),%r10d
0x000002261aedd586: cmp 0x8(%rax),%r10d
0x000002261aedd58a: jne 0x000002261adee4e0 ; {runtime_call ic_miss_stub}
[Verified Entry Point]
0x000002261aedd590: sub $0x18,%rsp
0x000002261aedd597: mov %rbp,0x10(%rsp)
0x000002261aedd59c: cmpl $0x0,0x20(%r15)
0x000002261aedd5a4: jne 0x000002261aedd691 ;*synchronization entry
; - app.JitTest$::f2@-1 (line 10)
0x000002261aedd5aa: shl %r8d
0x000002261aedd5ad: lea 0x80(%r8),%r10d
0x000002261aedd5b4: movabs $0x601804ab0,%rcx ; {oop(a 'java/lang/Integer'[256] {0x0000000601804ab0})}
0x000002261aedd5be: cmp $0x100,%r10d
0x000002261aedd5c5: jb 0x000002261aedd651 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd5cb: lea (%r9,%r9,2),%eax ;*imul {reexecute=0 rethrow=0 return_oop=0}
; - app.JitTest$::f2@11 (line 10)
0x000002261aedd5cf: lea 0x80(%rax),%r11d
0x000002261aedd5d6: cmp $0x100,%r11d
0x000002261aedd5dd: jb 0x000002261aedd666 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd5e3: lea (%rdi,%rdi,4),%r11d ;*imul {reexecute=0 rethrow=0 return_oop=0}
; - app.JitTest$::f2@17 (line 10)
0x000002261aedd5e7: lea 0x80(%r11),%r10d
0x000002261aedd5ee: cmp $0x100,%r10d
0x000002261aedd5f5: jb 0x000002261aedd63f
0x000002261aedd5f7: lea 0x0(,%rsi,8),%r10d
0x000002261aedd5ff: sub %esi,%r10d
0x000002261aedd602: lea 0x80(%r10),%ebx
0x000002261aedd609: cmp $0x100,%ebx
0x000002261aedd60f: jb 0x000002261aedd62d
0x000002261aedd611: add %r8d,%eax
0x000002261aedd614: add %r11d,%eax
0x000002261aedd617: add %r10d,%eax
0x000002261aedd61a: add $0x10,%rsp
0x000002261aedd61e: pop %rbp
0x000002261aedd61f: cmp 0x448(%r15),%rsp ; {poll_return}
0x000002261aedd626: ja 0x000002261aedd67b
0x000002261aedd62c: ret
0x000002261aedd62d: movslq %r10d,%r10 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd630: mov 0x210(%rcx,%r10,4),%r10d ;*aaload {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@21 (line 1018)
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@26 (line 10)
0x000002261aedd638: mov 0xc(%r12,%r10,8),%r10d ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::intValue@1 (line 1092)
; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
; - app.JitTest$::f2@69 (line 10)
0x000002261aedd63d: jmp 0x000002261aedd611
0x000002261aedd63f: movslq %r11d,%r10 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd642: mov 0x210(%rcx,%r10,4),%r11d ;*aaload {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@21 (line 1018)
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@18 (line 10)
0x000002261aedd64a: mov 0xc(%r12,%r11,8),%r11d ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::intValue@1 (line 1092)
; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
; - app.JitTest$::f2@59 (line 10)
0x000002261aedd64f: jmp 0x000002261aedd5f7
0x000002261aedd651: movslq %r8d,%r10 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd654: mov 0x210(%rcx,%r10,4),%r11d ;*aaload {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@21 (line 1018)
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd65c: mov 0xc(%r12,%r11,8),%r8d ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::intValue@1 (line 1092)
; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
; - app.JitTest$::f2@39 (line 10)
0x000002261aedd661: jmp 0x000002261aedd5cb
0x000002261aedd666: movslq %eax,%r10 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd669: mov 0x210(%rcx,%r10,4),%r10d ;*aaload {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::valueOf@21 (line 1018)
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@12 (line 10)
0x000002261aedd671: mov 0xc(%r12,%r10,8),%eax ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
; - java.lang.Integer::intValue@1 (line 1092)
; - scala.runtime.BoxesRunTime::unboxToInt@12 (line 99)
; - app.JitTest$::f2@49 (line 10)
0x000002261aedd676: jmp 0x000002261aedd5e3 ;*invokestatic valueOf {reexecute=0 rethrow=0 return_oop=0}
; - scala.runtime.BoxesRunTime::boxToInteger@1 (line 63)
; - app.JitTest$::f2@6 (line 10)
0x000002261aedd67b: movabs $0x2261aedd61f,%r10 ; {internal_word}
0x000002261aedd685: mov %r10,0x460(%r15)
0x000002261aedd68c: jmp 0x000002261adf53e0 ; {runtime_call SafepointBlob}
0x000002261aedd691: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
0x000002261aedd696: jmp 0x000002261aedd5aa
0x000002261aedd69b: hlt
0x000002261aedd69c: hlt
0x000002261aedd69d: hlt
0x000002261aedd69e: hlt
0x000002261aedd69f: hlt
[Exception Handler]
0x000002261aedd6a0: jmp 0x000002261ae27160 ; {no_reloc}
[Deopt Handler Code]
0x000002261aedd6a5: call 0x000002261aedd6aa
0x000002261aedd6aa: subq $0x5,(%rsp)
0x000002261aedd6af: jmp 0x000002261adf4680 ; {runtime_call DeoptimizationBlob}
0x000002261aedd6b4: hlt
0x000002261aedd6b5: hlt
0x000002261aedd6b6: hlt
0x000002261aedd6b7: hlt
--------------------------------------------------------------------------------
[/Disassembly]
And for f4:
[Instructions begin]
0x000002261aef9a00: xchg %ax,%ax
[Entry Point]
# {method} {0x000002266e977f40} 'f4' '(IIII)I' in 'app/JitTest$'
# this: rdx:rdx = 'app/JitTest$'
# parm0: r8 = int
# parm1: r9 = int
# parm2: rdi = int
# parm3: rsi = int
# [sp+0x20] (sp of caller)
0x000002261aef9a02: mov 0x8(%rdx),%r10d
0x000002261aef9a06: cmp 0x8(%rax),%r10d
0x000002261aef9a0a: jne 0x000002261adee4e0 ; {runtime_call ic_miss_stub}
[Verified Entry Point]
0x000002261aef9a10: sub $0x18,%rsp
0x000002261aef9a17: mov %rbp,0x10(%rsp)
0x000002261aef9a1c: cmpl $0x0,0x20(%r15)
0x000002261aef9a24: jne 0x000002261aef9a6e ;*synchronization entry
; - app.JitTest$::f4@-1 (line 23)
0x000002261aef9a2a: lea (%r9,%r9,2),%r11d
0x000002261aef9a2e: lea (%rdi,%rdi,4),%r10d
0x000002261aef9a32: lea (%r11,%r8,2),%r8d
0x000002261aef9a36: add %r8d,%r10d
0x000002261aef9a39: lea 0x0(,%rsi,8),%eax
0x000002261aef9a40: sub %esi,%eax
0x000002261aef9a42: add %r10d,%eax ;*iadd {reexecute=0 rethrow=0 return_oop=0}
; - app.JitTest$::f4@32 (line 28)
0x000002261aef9a45: add $0x10,%rsp
0x000002261aef9a49: pop %rbp
0x000002261aef9a4a: cmp 0x448(%r15),%rsp ; {poll_return}
0x000002261aef9a51: ja 0x000002261aef9a58
0x000002261aef9a57: ret
0x000002261aef9a58: movabs $0x2261aef9a4a,%r10 ; {internal_word}
0x000002261aef9a62: mov %r10,0x460(%r15)
0x000002261aef9a69: jmp 0x000002261adf53e0 ; {runtime_call SafepointBlob}
0x000002261aef9a6e: call Stub::nmethod_entry_barrier ; {runtime_call StubRoutines (final stubs)}
0x000002261aef9a73: jmp 0x000002261aef9a2a
[Exception Handler]
0x000002261aef9a78: jmp 0x000002261ae27160 ; {no_reloc}
[Deopt Handler Code]
0x000002261aef9a7d: call 0x000002261aef9a82
0x000002261aef9a82: subq $0x5,(%rsp)
0x000002261aef9a87: jmp 0x000002261adf4680 ; {runtime_call DeoptimizationBlob}
0x000002261aef9a8c: hlt
0x000002261aef9a8d: hlt
0x000002261aef9a8e: hlt
0x000002261aef9a8f: hlt
--------------------------------------------------------------------------------
[/Disassembly]
It's not even close. In f2 we allocate, box, unbox, it's all over the place. You just can't rely on hotspot to always clean up. My vote is to merge.
Another case is val (a, b) = (b, a), hope this kind of case can be optimized too.
@noti0na1 Hope this backported to 3.3.x too.
@noti0na1 Hope this backported to 3.3.x too.
We will try, but might be hard to do. I will look into this after 20th. The new LTS is expected around the same as 3.7.3
Looks like, this is breaking in LTS, it's probably an interaction with the changes to named tuples.
I don't think this would be a candidate for LTS backporting.