cmd/compile: multiple performance regressions due to CL 270940, 448036, or 454036
The performance dashboard shows several regressions with the same culprit. Look for entries that say "change at fc81405".
Digging further, there are actually three commits between the previous good run and the bad run:
https://go.googlesource.com/go/+log/b419db6c15519a29ff3d7d2e56d8f115204f8c5d..fc814056aae191f61f46bef5be6e29ee3dc09b89
- https://go.dev/cl/270940 cmd/compile: improve scheduling pass
- https://go.dev/cl/448036 runtime: remove 104 byte stack guard
- https://go.dev/cl/454036 cmd/compile: rewrite empty makeslice to zerobase pointer
My gut says that the scheduling pass CL is most likely to cause regressions (cc @randall77).
The full set of regressions I see:
- MulWorkspaceDense1000Hundredth-8 (sec/op): +22.69%
- Source, module version
v0.9.3
- Source, module version
- Hash8K-8 (sec/op): +4.87%
- Source, module version
v0.0.0-20180603041446-333f0593be06
- Source, module version
- Dgeev/Circulant100-8 (sec/op): +3.09%
- Source, module version
v0.9.3
- Source, module version
- FoglemanPathTraceRenderGopherIter1 (sec/op): +2.09%
- Source, ~tip
- MTFT-8 (sec/op): +1.13%
- Source, module version
v1.9.0
- Source, module version
- FoglemanFauxGLRenderRotateBoat (sec/op): +1.12%
- Source, ~tip
- Encoding4KBVerySparse-8 (sec/op): +4.75%
- Source, module version
v1.10.9
- Source, module version
cc @golang/compiler
For reference, here are the raw results from the runs immediate before and after the change.
Before: https://perf.golang.org/search?q=upload%3A20230120.26+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment After: https://perf.golang.org/search?q=upload%3A20230120.29+%7C+toolchain%3Abaseline+vs+toolchain%3Aexperiment
The geomean change of time/op went from -1.24% before to -1.26%, so overall things seems slightly better, though I'm not sure this difference is statistically significant.
I can't reproduce, at least for BenchmarkMulWorkspaceDense1000Hundredth.
All of 1.19.4, 1.20rc3, and tip are within just a percent or so of each other. In fact tip is the fastest.
Is this linux/amd64? Are there more detailed instructions to reproduce?
Ah, I think I need -tags noasm as part of the build. That gives me some performance delta.
Old loop
13.25s 13.25s 508284: MULSD X1, X3
10.38s 10.41s 508288: ADDSD 0(R11)(R13*8), X3
28.20s 28.21s 50828e: MOVSD_XMM X3, 0(R11)(R13*8)
16.44s 16.44s 508294: INCQ R13
7.65s 7.65s 508297: CMPQ R13, BX
. 10ms 50829a: JLE 0x5082a6
5.27s 5.29s 50829c: MOVSD_XMM 0(DI)(R13*8), X3
12.29s 12.29s 5082a2: JA 0x508284
New loop:
7.41s 7.42s 502fe5: LEAQ 0x1(R13), R8
9.70s 9.70s 502fe9: MULSD X1, X3
16.60s 16.61s 502fed: ADDSD 0(R11)(R13*8), X3
27.75s 27.75s 502ff3: MOVSD_XMM X3, 0(R11)(R13*8)
12.59s 12.59s 502ff9: MOVQ R8, R13
6.34s 6.34s 502ffc: MOVQ 0x88(SP), R8
8.22s 8.22s 503004: CMPQ R13, BX
. . 503007: JLE 0x503013
9.95s 9.96s 503009: MOVSD_XMM 0(DI)(R13*8), X3
17.17s 17.17s 50300f: JA 0x502fe5
So #57976 is definitely in play, but there's also the MOVQ 0x88(SP), R8 I don't understand. I will continue looking tomorrow.
Change https://go.dev/cl/463751 mentions this issue: cmd/compile: schedule values with no in-block uses later
That test runs as part of bent: https://cs.opensource.google/go/x/benchmarks/+/master:cmd/bent/configs/benchmarks-50.toml;l=81. I actually don't see anywhere that it passes -tags noasm, but perhaps I missed something? (cc @dr2chase)
I just submitted a fix, at least for BenchmarkMulWorkspaceDense1000Hundredth. I haven't checked the other regressions but I think they are probably related. We'll see if they recover on the perf dashboard.
Looks like these are fixed:
MulWorkspaceDense1000Hundredth
Dgeev/Circulant100
These are not:
Hash8K
FoglemanPathTraceRenderGopherIter1
MTFT
FoglemanFauxGLRenderRotateBoat
The last 2 are noisy enough that it is hard to be sure, though.
Encoding4KBVerySparse looks even worse after the fix :(
For reference, the run on your fix CL (commit 6224db9) was the very last run using 1.19 as baseline. The very next run (commit ab0f045) used 1.20 as baseline.
So the best way to eyeball impact is to look only at results on the 6224db9 run.
To be clear, I still (mostly) agree with your assessment @randall77.
The only point I disagree on is Encoding4KBVerySparse. At 6224db9 it looks unchanged (maybe slightly worse). The reason it jumps up to +7% on the next commit is because prior to regression this benchmark was -10% vs 1.19. 1.20 locked in those gains, so now with a baseline of 1.20 this benchmark is regressed vs 1.20.
Encoding4KBVerySparse is a very micro benchmark, so it might just be noisy for silly reasons.
Random slightly-related thought, Emery Berger and student had a project where they randomized things like load order so as to generate a more normal distribution of benchmark results, thus allowing you to compare actual means. Since, if we're doing build benchmarking, we compile and link more than once, we could add a random seed to the linker for symbol (package?) ordering, build multiple binaries and get some of that sweet sweet randomness.
Looking a bit more at Hash8K. It is slower at tip than 1.20, by about 6%.
It is all due to (*blake2s/Digest).compress.
It's a pretty big function, it compiles to 2008 instructions in 1.20 and 2014 instructions in tip. It's all one giant manually unrolled basic block.
It wants to put a lot more stuff in registers than is available on amd64. So keeping live values to a minimum and generating proper spill/restore code is important.
Not really clear yet what about this is affected by the different schedule. I'll see if any tweaks could make it a bit better.
Several theories discarded. Current hypothesis: the different schedule causes the register allocator to allocate a different output register than both input registers more often. There are lots of adds in this code, which is a 2-register operation. The nonmatching registers forces a MOV+OP or a LEAL instead of an in-place update.
#58534 should reduce the targets left in this issue to MTFT and Hash8k.