go
go copied to clipboard
cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS
Go version
go version go1.23.0 linux/amd64
Output of go env in your module/workspace:
GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/bremac/.cache/go-build'
GOENV='/home/bremac/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/bremac/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/bremac/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/bremac/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3831806201=/tmp/go-build -gno-record-gcc-switches'
What did you do?
This code is a simplified form of a longer unrolled loop, with the non-carry-related logic removed:
func example(carryIn uint, x, y, result []uint) uint {
// Check lengths up-front to simplify the code generated for the loop
if len(x) != len(y) || len(x) != len(result) {
panic("length mismatch")
}
for i := 0; i < len(x); i++ {
result[i], carryIn = bits.Add(x[i], y[i], carryIn)
}
return carryIn
}
https://go.dev/play/p/gGVkiLN6qbV https://go.godbolt.org/z/W313f1EYG
What did you see happen?
On amd64, the compiled loop has a throughput of one iteration every four cycles:
main_example_pc48:
MOVQ (BX)(DI*8), R8
MOVQ (SI)(DI*8), R9
LEAQ 1(DI), R10
NEGL AX
ADCQ R8, R9
MOVQ R9, (DX)(DI*8)
SBBQ AX, AX
NEGQ AX
MOVQ R10, DI
main_example_pc78:
CMPQ CX, DI
JGT main_example_pc48
The bottleneck is the NEGL -> ADCQ -> SBBQ -> NEGQ dependency chain.
What did you expect to see?
The SBBQ / NEGQ pair should use SETCS instead, e.g.
main_example_pc48:
MOVQ (BX)(DI*8), R8
MOVQ (SI)(DI*8), R9
LEAQ 1(DI), R10
NEGL AX
ADCQ R8, R9
MOVQ R9, (DX)(DI*8)
SETCS AX
MOVQ R10, DI
main_example_pc78:
CMPQ CX, DI
JGT main_example_pc48
This shortens the dependency chain to three instructions.