runtime
runtime copied to clipboard
JIT: Reorder stores to make them amenable to stp optimization
This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes.
The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.
Example:
class Body { public double x, y, z, vx, vy, vz, mass; }
static void Advance(double dt, Body[] bodies)
{
foreach (Body b in bodies)
{
b.x += dt * b.vx;
b.y += dt * b.vy;
b.z += dt * b.vz;
}
}
Diff:
@@ -1,18 +1,17 @@
-G_M55007_IG04: ;; offset=0x001C
+G_M55007_IG04: ;; offset=0x0020
ldr x3, [x0, w1, UXTW #3]
ldp d16, d17, [x3, #0x08]
ldp d18, d19, [x3, #0x20]
fmul d18, d0, d18
fadd d16, d16, d18
- str d16, [x3, #0x08]
- fmul d16, d0, d19
- fadd d16, d17, d16
- str d16, [x3, #0x10]
+ fmul d18, d0, d19
+ fadd d17, d17, d18
+ stp d16, d17, [x3, #0x08]
ldr d16, [x3, #0x18]
ldr d17, [x3, #0x30]
fmul d17, d0, d17
fadd d16, d16, d17
str d16, [x3, #0x18]
add w1, w1, #1
cmp w2, w1
bgt G_M55007_IG04
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn
Azure Pipelines successfully started running 3 pipeline(s).
Nice! I see you're hitting similar size regressions that I did too, e.g.:
fmov s16, #1.0000
- stp s16, s16, [x0, #0x08]
+ fmov s17, #1.0000
+ stp s16, s17, [x0, #0x08]
presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64
Nice! I see you're hitting similar size regressions that I did too, e.g.:
fmov s16, #1.0000 - stp s16, s16, [x0, #0x08] + fmov s17, #1.0000 + stp s16, s17, [x0, #0x08]
presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64
There is a simple workaround (this PR already has it for integer constants). It just doesn't work for float constants because GenTree::Compare
does not handle GT_CNS_DBL
. I'll fix it in a follow up
/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress
Azure Pipelines successfully started running 2 pipeline(s).
The failures look known.
cc @dotnet/jit-contrib PTAL @EgorBo @kunalspathak
I wonder if there are any worries with stp
similar to the ldp
issue on Apple Silicon. If so I guess we can handle it as part of that.
FYI @a74nh @SwapnilGaikwad
do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?
No, the first indirection needs to happen after the data node of the second indirection so that we don't end up with something between the two indirections (which would break the str,str -> stp
emitter peephole).
Here is the relevant JITDUMP for the example shared above:
[000037] and [000025] are indirs off the same base with offsets +016 and +008
..and they are amenable to ldp/stp optimization
..and they are close. Trying to move the following range (where * are nodes part of the data flow):
N001 ( 1, 1) [000015] ----------- t15 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t15 ref
N003 ( 3, 4) [000073] -c--------- t73 = ▌ LEA(b+8) byref
┌──▌ t73 byref
N004 ( 4, 3) [000017] ---XG------ t17 = ▌ IND double <l:$441, c:$442>
* N001 ( 1, 1) [000027] ----------- t27 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t27 ref
* N003 ( 3, 4) [000079] -c--------- t79 = ▌ LEA(b+16) byref
┌──▌ t79 byref
* N004 ( 4, 3) [000029] n---GO----- t29 = ▌ IND double <l:$44f, c:$450>
N005 ( 1, 1) [000018] ----------- t18 = LCL_VAR double V00 arg0 u:1 $100
N006 ( 1, 1) [000019] ----------- t19 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t19 ref
N008 ( 3, 4) [000075] -c--------- t75 = ▌ LEA(b+32) byref
┌──▌ t75 byref
N009 ( 4, 3) [000021] n---GO----- t21 = ▌ IND double <l:$444, c:$445>
* N006 ( 1, 1) [000031] ----------- t31 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t31 ref
* N008 ( 3, 4) [000081] -c--------- t81 = ▌ LEA(b+40) byref
┌──▌ t81 byref
* N009 ( 4, 3) [000033] n---GO----- t33 = ▌ IND double <l:$452, c:$453>
┌──▌ t18 double
├──▌ t21 double
N010 ( 10, 8) [000022] ----GO----- t22 = ▌ MUL double <l:$449, c:$448>
┌──▌ t17 double
├──▌ t22 double
N011 ( 19, 15) [000023] ---XGO----- t23 = ▌ ADD double <l:$44d, c:$44c>
N012 ( 1, 1) [000014] ----------- t14 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t14 ref
N014 ( 3, 4) [000071] -c--------- t71 = ▌ LEA(b+8) byref
┌──▌ t71 byref
├──▌ t23 double
1. N015 ( 24, 19) [000025] nA-XGO----- ▌ STOREIND double <l:$206, c:$205>
[000109] ----------- IL_OFFSET void INLRT @ 0x01F[E-]
* N005 ( 1, 1) [000030] ----------- t30 = LCL_VAR double V00 arg0 u:1 $100
┌──▌ t30 double
├──▌ t33 double
* N010 ( 10, 8) [000034] ----GO----- t34 = ▌ MUL double <l:$457, c:$456>
┌──▌ t29 double
├──▌ t34 double
* N011 ( 19, 15) [000035] ----GO----- t35 = ▌ ADD double <l:$45b, c:$45a>
* N012 ( 1, 1) [000026] ----------- t26 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t26 ref
* N014 ( 3, 4) [000077] -c--------- t77 = ▌ LEA(b+16) byref
┌──▌ t77 byref
├──▌ t35 double
2. N015 ( 24, 19) [000037] nA--GO----- ▌ STOREIND double <l:$206, c:$205>
Interference checks passed: can move unrelated nodes past second indir.
Moving nodes that are not part of data flow of [000037]
Result:
N001 ( 1, 1) [000015] ----------- t15 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t15 ref
N003 ( 3, 4) [000073] -c--------- t73 = ▌ LEA(b+8) byref
┌──▌ t73 byref
N004 ( 4, 3) [000017] ---XG------ t17 = ▌ IND double <l:$441, c:$442>
* N001 ( 1, 1) [000027] ----------- t27 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t27 ref
* N003 ( 3, 4) [000079] -c--------- t79 = ▌ LEA(b+16) byref
┌──▌ t79 byref
* N004 ( 4, 3) [000029] n---GO----- t29 = ▌ IND double <l:$44f, c:$450>
N005 ( 1, 1) [000018] ----------- t18 = LCL_VAR double V00 arg0 u:1 $100
N006 ( 1, 1) [000019] ----------- t19 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t19 ref
N008 ( 3, 4) [000075] -c--------- t75 = ▌ LEA(b+32) byref
┌──▌ t75 byref
N009 ( 4, 3) [000021] n---GO----- t21 = ▌ IND double <l:$444, c:$445>
* N006 ( 1, 1) [000031] ----------- t31 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t31 ref
* N008 ( 3, 4) [000081] -c--------- t81 = ▌ LEA(b+40) byref
┌──▌ t81 byref
* N009 ( 4, 3) [000033] n---GO----- t33 = ▌ IND double <l:$452, c:$453>
┌──▌ t18 double
├──▌ t21 double
N010 ( 10, 8) [000022] ----GO----- t22 = ▌ MUL double <l:$449, c:$448>
┌──▌ t17 double
├──▌ t22 double
N011 ( 19, 15) [000023] ---XGO----- t23 = ▌ ADD double <l:$44d, c:$44c>
N012 ( 1, 1) [000014] ----------- t14 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t14 ref
N014 ( 3, 4) [000071] -c--------- t71 = ▌ LEA(b+8) byref
* N005 ( 1, 1) [000030] ----------- t30 = LCL_VAR double V00 arg0 u:1 $100
┌──▌ t30 double
├──▌ t33 double
* N010 ( 10, 8) [000034] ----GO----- t34 = ▌ MUL double <l:$457, c:$456>
┌──▌ t29 double
├──▌ t34 double
* N011 ( 19, 15) [000035] ----GO----- t35 = ▌ ADD double <l:$45b, c:$45a>
┌──▌ t71 byref
├──▌ t23 double
1. N015 ( 24, 19) [000025] nA-XGO----- ▌ STOREIND double <l:$206, c:$205>
* N012 ( 1, 1) [000026] ----------- t26 = LCL_VAR ref V04 loc2 u:2 <l:$202, c:$142>
┌──▌ t26 ref
* N014 ( 3, 4) [000077] -c--------- t77 = ▌ LEA(b+16) byref
┌──▌ t77 byref
├──▌ t35 double
2. N015 ( 24, 19) [000037] nA--GO----- ▌ STOREIND double <l:$206, c:$205>
[000109] ----------- IL_OFFSET void INLRT @ 0x01F[E-]
Here you can see that [000025]
was moved past [000035]
, the data node of the second indir.
@kunalspathak Any other feedback or are you ok with merging this as is?
Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?
Yep, they look known.