runtime JIT: Reorder stores to make them amenable to stp optimization

This generalizes the indir reordering optimization (that currently only triggers for loads) to kick in for GT_STOREIND nodes.

The main complication with doing this is the fact that the data node of the second indirection needs its own reordering with the previous indirection. The existing logic works by reordering all nodes between the first and second indirection that are unrelated to the second indirection's computation to happen after it. Once that is done we know that there are no uses of the first indirection's result between it and the second indirection, so after doing the necessary interference checks we can safely move the previous indirection to happen after the data node of the second indirection.

Example:

class Body { public double x, y, z, vx, vy, vz, mass; }

static void Advance(double dt, Body[] bodies)
{
    foreach (Body b in bodies)
    {
        b.x += dt * b.vx;
        b.y += dt * b.vy;
        b.z += dt * b.vz;
    }
}

Diff:

@@ -1,18 +1,17 @@
-G_M55007_IG04:  ;; offset=0x001C
+G_M55007_IG04:  ;; offset=0x0020
             ldr     x3, [x0, w1, UXTW #3]
             ldp     d16, d17, [x3, #0x08]
             ldp     d18, d19, [x3, #0x20]
             fmul    d18, d0, d18
             fadd    d16, d16, d18
-            str     d16, [x3, #0x08]
-            fmul    d16, d0, d19
-            fadd    d16, d17, d16
-            str     d16, [x3, #0x10]
+            fmul    d18, d0, d19
+            fadd    d17, d17, d18
+            stp     d16, d17, [x3, #0x08]
             ldr     d16, [x3, #0x18]
             ldr     d17, [x3, #0x30]
             fmul    d17, d0, d17
             fadd    d16, d16, d17
             str     d16, [x3, #0x18]
             add     w1, w1, #1
             cmp     w2, w1
             bgt     G_M55007_IG04

May 12 '24 14:05 jakobbotsch

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

May 12 '24 14:05 dotnet-policy-service[bot]

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress, Fuzzlyn

May 12 '24 17:05 jakobbotsch

Azure Pipelines successfully started running 3 pipeline(s).

May 12 '24 17:05 azure-pipelines[bot]

Nice! I see you're hitting similar size regressions that I did too, e.g.:

             fmov    s16, #1.0000
-            stp     s16, s16, [x0, #0x08]
+            fmov    s17, #1.0000
+            stp     s16, s17, [x0, #0x08]

presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64

May 12 '24 18:05 EgorBo

Nice! I see you're hitting similar size regressions that I did too, e.g.:
             fmov    s16, #1.0000
-            stp     s16, s16, [x0, #0x08]
+            fmov    s17, #1.0000
+            stp     s16, s17, [x0, #0x08]
presumably, this should be fixed with "matching constant" in LSRA, but I guess it's currently disabled for SIMD on arm64

There is a simple workaround (this PR already has it for integer constants). It just doesn't work for float constants because GenTree::Compare does not handle GT_CNS_DBL. I'll fix it in a follow up

May 12 '24 18:05 jakobbotsch

/azp run runtime-coreclr jitstress, runtime-coreclr libraries-jitstress

May 13 '24 10:05 jakobbotsch

Azure Pipelines successfully started running 2 pipeline(s).

May 13 '24 10:05 azure-pipelines[bot]

The failures look known.

cc @dotnet/jit-contrib PTAL @EgorBo @kunalspathak

Diffs.

I wonder if there are any worries with stp similar to the ldp issue on Apple Silicon. If so I guess we can handle it as part of that.

FYI @a74nh @SwapnilGaikwad

May 13 '24 17:05 jakobbotsch

do you mean "before" the data node of the second indirection? Can you share an example or a dump to visualize this?

No, the first indirection needs to happen after the data node of the second indirection so that we don't end up with something between the two indirections (which would break the str,str -> stp emitter peephole).

Here is the relevant JITDUMP for the example shared above:

[000037] and [000025] are indirs off the same base with offsets +016 and +008
  ..and they are amenable to ldp/stp optimization
  ..and they are close. Trying to move the following range (where * are nodes part of the data flow):

   N001 (  1,  1) [000015] -----------                   t15 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t15    ref    
   N003 (  3,  4) [000073] -c---------                   t73 = ▌  LEA(b+8)  byref 
                                                               ┌──▌  t73    byref  
   N004 (  4,  3) [000017] ---XG------                   t17 = ▌  IND       double <l:$441, c:$442>
*  N001 (  1,  1) [000027] -----------                   t27 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t27    ref    
*  N003 (  3,  4) [000079] -c---------                   t79 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t79    byref  
*  N004 (  4,  3) [000029] n---GO-----                   t29 = ▌  IND       double <l:$44f, c:$450>
   N005 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   double V00 arg0         u:1 $100
   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 = ▌  LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
   N009 (  4,  3) [000021] n---GO-----                   t21 = ▌  IND       double <l:$444, c:$445>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 = ▌  LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
*  N009 (  4,  3) [000033] n---GO-----                   t33 = ▌  IND       double <l:$452, c:$453>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 = ▌  MUL       double <l:$449, c:$448>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 = ▌  ADD       double <l:$44d, c:$44c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 = ▌  LEA(b+8)  byref 
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
1. N015 ( 24, 19) [000025] nA-XGO-----                         ▌  STOREIND  double <l:$206, c:$205>
                  [000109] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]
*  N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
                                                               ┌──▌  t30    double 
                                                               ├──▌  t33    double 
*  N010 ( 10,  8) [000034] ----GO-----                   t34 = ▌  MUL       double <l:$457, c:$456>
                                                               ┌──▌  t29    double 
                                                               ├──▌  t34    double 
*  N011 ( 19, 15) [000035] ----GO-----                   t35 = ▌  ADD       double <l:$45b, c:$45a>
*  N012 (  1,  1) [000026] -----------                   t26 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t26    ref    
*  N014 (  3,  4) [000077] -c---------                   t77 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t77    byref  
                                                               ├──▌  t35    double 
2. N015 ( 24, 19) [000037] nA--GO-----                         ▌  STOREIND  double <l:$206, c:$205>

Interference checks passed: can move unrelated nodes past second indir.
Moving nodes that are not part of data flow of [000037]

Result:

   N001 (  1,  1) [000015] -----------                   t15 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t15    ref    
   N003 (  3,  4) [000073] -c---------                   t73 = ▌  LEA(b+8)  byref 
                                                               ┌──▌  t73    byref  
   N004 (  4,  3) [000017] ---XG------                   t17 = ▌  IND       double <l:$441, c:$442>
*  N001 (  1,  1) [000027] -----------                   t27 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t27    ref    
*  N003 (  3,  4) [000079] -c---------                   t79 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t79    byref  
*  N004 (  4,  3) [000029] n---GO-----                   t29 = ▌  IND       double <l:$44f, c:$450>
   N005 (  1,  1) [000018] -----------                   t18 =    LCL_VAR   double V00 arg0         u:1 $100
   N006 (  1,  1) [000019] -----------                   t19 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t19    ref    
   N008 (  3,  4) [000075] -c---------                   t75 = ▌  LEA(b+32) byref 
                                                               ┌──▌  t75    byref  
   N009 (  4,  3) [000021] n---GO-----                   t21 = ▌  IND       double <l:$444, c:$445>
*  N006 (  1,  1) [000031] -----------                   t31 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t31    ref    
*  N008 (  3,  4) [000081] -c---------                   t81 = ▌  LEA(b+40) byref 
                                                               ┌──▌  t81    byref  
*  N009 (  4,  3) [000033] n---GO-----                   t33 = ▌  IND       double <l:$452, c:$453>
                                                               ┌──▌  t18    double 
                                                               ├──▌  t21    double 
   N010 ( 10,  8) [000022] ----GO-----                   t22 = ▌  MUL       double <l:$449, c:$448>
                                                               ┌──▌  t17    double 
                                                               ├──▌  t22    double 
   N011 ( 19, 15) [000023] ---XGO-----                   t23 = ▌  ADD       double <l:$44d, c:$44c>
   N012 (  1,  1) [000014] -----------                   t14 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t14    ref    
   N014 (  3,  4) [000071] -c---------                   t71 = ▌  LEA(b+8)  byref 
*  N005 (  1,  1) [000030] -----------                   t30 =    LCL_VAR   double V00 arg0         u:1 $100
                                                               ┌──▌  t30    double 
                                                               ├──▌  t33    double 
*  N010 ( 10,  8) [000034] ----GO-----                   t34 = ▌  MUL       double <l:$457, c:$456>
                                                               ┌──▌  t29    double 
                                                               ├──▌  t34    double 
*  N011 ( 19, 15) [000035] ----GO-----                   t35 = ▌  ADD       double <l:$45b, c:$45a>
                                                               ┌──▌  t71    byref  
                                                               ├──▌  t23    double 
1. N015 ( 24, 19) [000025] nA-XGO-----                         ▌  STOREIND  double <l:$206, c:$205>
*  N012 (  1,  1) [000026] -----------                   t26 =    LCL_VAR   ref    V04 loc2         u:2 <l:$202, c:$142>
                                                               ┌──▌  t26    ref    
*  N014 (  3,  4) [000077] -c---------                   t77 = ▌  LEA(b+16) byref 
                                                               ┌──▌  t77    byref  
                                                               ├──▌  t35    double 
2. N015 ( 24, 19) [000037] nA--GO-----                         ▌  STOREIND  double <l:$206, c:$205>
                  [000109] -----------                            IL_OFFSET void   INLRT @ 0x01F[E-]

Here you can see that [000025] was moved past [000035], the data node of the second indir.

May 13 '24 18:05 jakobbotsch

@kunalspathak Any other feedback or are you ok with merging this as is?

May 15 '24 20:05 jakobbotsch

Ah, sorry, I thought I already approved. The JitStress* failures are existing issues I assume?

Yep, they look known.

May 15 '24 21:05 jakobbotsch

runtime runtime copied to clipboard

JIT: Reorder stores to make them amenable to stp optimization

runtime
runtime copied to clipboard