go cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036

The performance dashboard shows several regressions with the same culprit. Look for entries that say "change at fc81405".

Digging further, there are actually three commits between the previous good run and the bad run:

https://go.googlesource.com/go/+log/b419db6c15519a29ff3d7d2e56d8f115204f8c5d..fc814056aae191f61f46bef5be6e29ee3dc09b89

https://go.dev/cl/270940 cmd/compile: improve scheduling pass
https://go.dev/cl/448036 runtime: remove 104 byte stack guard
https://go.dev/cl/454036 cmd/compile: rewrite empty makeslice to zerobase pointer

My gut says that the scheduling pass CL is most likely to cause regressions (cc @randall77).

The full set of regressions I see:

MulWorkspaceDense1000Hundredth-8 (sec/op): +22.69%
- Source, module version v0.9.3
Hash8K-8 (sec/op): +4.87%
- Source, module version v0.0.0-20180603041446-333f0593be06
Dgeev/Circulant100-8 (sec/op): +3.09%
- Source, module version v0.9.3
FoglemanPathTraceRenderGopherIter1 (sec/op): +2.09%
- Source, ~tip
MTFT-8 (sec/op): +1.13%
- Source, module version v1.9.0
FoglemanFauxGLRenderRotateBoat (sec/op): +1.12%
- Source, ~tip
Encoding4KBVerySparse-8 (sec/op): +4.75%
- Source, module version v1.10.9

cc @golang/compiler

Jan 31 '23 18:01 prattmic

For reference, here are the raw results from the runs immediate before and after the change.

Before: https://perf.golang.org/search?q=upload%3A20230120.26+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment After: https://perf.golang.org/search?q=upload%3A20230120.29+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment

The geomean change of time/op went from -1.24% before to -1.26%, so overall things seems slightly better, though I'm not sure this difference is statistically significant.

Jan 31 '23 19:01 prattmic

I can't reproduce, at least for BenchmarkMulWorkspaceDense1000Hundredth. All of 1.19.4, 1.20rc3, and tip are within just a percent or so of each other. In fact tip is the fastest.

Is this linux/amd64? Are there more detailed instructions to reproduce?

Jan 31 '23 21:01 randall77

Ah, I think I need -tags noasm as part of the build. That gives me some performance delta.

Jan 31 '23 23:01 randall77

Old loop

    13.25s     13.25s     508284: MULSD X1, X3
    10.38s     10.41s     508288: ADDSD 0(R11)(R13*8), X3
    28.20s     28.21s     50828e: MOVSD_XMM X3, 0(R11)(R13*8)
    16.44s     16.44s     508294: INCQ R13
     7.65s      7.65s     508297: CMPQ R13, BX
         .       10ms     50829a: JLE 0x5082a6
     5.27s      5.29s     50829c: MOVSD_XMM 0(DI)(R13*8), X3
    12.29s     12.29s     5082a2: JA 0x508284

New loop:

    7.41s      7.42s     502fe5: LEAQ 0x1(R13), R8
     9.70s      9.70s     502fe9: MULSD X1, X3
    16.60s     16.61s     502fed: ADDSD 0(R11)(R13*8), X3
    27.75s     27.75s     502ff3: MOVSD_XMM X3, 0(R11)(R13*8)
    12.59s     12.59s     502ff9: MOVQ R8, R13
     6.34s      6.34s     502ffc: MOVQ 0x88(SP), R8
     8.22s      8.22s     503004: CMPQ R13, BX
         .          .     503007: JLE 0x503013
     9.95s      9.96s     503009: MOVSD_XMM 0(DI)(R13*8), X3
    17.17s     17.17s     50300f: JA 0x502fe5

So #57976 is definitely in play, but there's also the MOVQ 0x88(SP), R8 I don't understand. I will continue looking tomorrow.

Feb 01 '23 00:02 randall77

Change https://go.dev/cl/463751 mentions this issue: cmd/compile: schedule values with no in-block uses later

Feb 01 '23 16:02 gopherbot

That test runs as part of bent: https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/benchmarks-50.toml;l=81. I actually don't see anywhere that it passes -tags noasm, but perhaps I missed something? (cc @dr2chase)

Feb 01 '23 16:02 prattmic

I just submitted a fix, at least for BenchmarkMulWorkspaceDense1000Hundredth. I haven't checked the other regressions but I think they are probably related. We'll see if they recover on the perf dashboard.

Feb 01 '23 18:02 randall77

Looks like these are fixed:

MulWorkspaceDense1000Hundredth
Dgeev/Circulant100

These are not:

Hash8K
FoglemanPathTraceRenderGopherIter1
MTFT
FoglemanFauxGLRenderRotateBoat

The last 2 are noisy enough that it is hard to be sure, though.

Encoding4KBVerySparse looks even worse after the fix :(

Feb 02 '23 20:02 randall77

For reference, the run on your fix CL (commit 6224db9) was the very last run using 1.19 as baseline. The very next run (commit ab0f045) used 1.20 as baseline.

So the best way to eyeball impact is to look only at results on the 6224db9 run.

Feb 02 '23 20:02 prattmic

To be clear, I still (mostly) agree with your assessment @randall77.

The only point I disagree on is Encoding4KBVerySparse. At 6224db9 it looks unchanged (maybe slightly worse). The reason it jumps up to +7% on the next commit is because prior to regression this benchmark was -10% vs 1.19. 1.20 locked in those gains, so now with a baseline of 1.20 this benchmark is regressed vs 1.20.

Feb 02 '23 20:02 prattmic

Encoding4KBVerySparse is a very micro benchmark, so it might just be noisy for silly reasons.

Random slightly-related thought, Emery Berger and student had a project where they randomized things like load order so as to generate a more normal distribution of benchmark results, thus allowing you to compare actual means. Since, if we're doing build benchmarking, we compile and link more than once, we could add a random seed to the linker for symbol (package?) ordering, build multiple binaries and get some of that sweet sweet randomness.

Feb 02 '23 20:02 dr2chase

Looking a bit more at Hash8K. It is slower at tip than 1.20, by about 6%.

It is all due to (*blake2s/Digest).compress. It's a pretty big function, it compiles to 2008 instructions in 1.20 and 2014 instructions in tip. It's all one giant manually unrolled basic block. It wants to put a lot more stuff in registers than is available on amd64. So keeping live values to a minimum and generating proper spill/restore code is important.

Not really clear yet what about this is affected by the different schedule. I'll see if any tweaks could make it a bit better.

Feb 07 '23 21:02 randall77

Several theories discarded. Current hypothesis: the different schedule causes the register allocator to allocate a different output register than both input registers more often. There are lots of adds in this code, which is a 2-register operation. The nonmatching registers forces a MOV+OP or a LEAL instead of an in-place update.

Feb 08 '23 00:02 randall77

#58534 should reduce the targets left in this issue to MTFT and Hash8k.

Feb 15 '23 20:02 mknyszek