go icon indicating copy to clipboard operation
go copied to clipboard

cmd/compile: amd64 carry flag spilling uses SBBQ + NEGQ instead of SETCS

Open bremac opened this issue 1 year ago • 7 comments

Go version

go version go1.23.0 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/bremac/.cache/go-build'
GOENV='/home/bremac/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/bremac/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/bremac/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/home/bremac/.config/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3831806201=/tmp/go-build -gno-record-gcc-switches'

What did you do?

This code is a simplified form of a longer unrolled loop, with the non-carry-related logic removed:

func example(carryIn uint, x, y, result []uint) uint {
	// Check lengths up-front to simplify the code generated for the loop
	if len(x) != len(y) || len(x) != len(result) {
		panic("length mismatch")
	}
	for i := 0; i < len(x); i++ {
		result[i], carryIn = bits.Add(x[i], y[i], carryIn)
	}
	return carryIn
}

https://go.dev/play/p/gGVkiLN6qbV https://go.godbolt.org/z/W313f1EYG

What did you see happen?

On amd64, the compiled loop has a throughput of one iteration every four cycles:

main_example_pc48:
        MOVQ    (BX)(DI*8), R8
        MOVQ    (SI)(DI*8), R9
        LEAQ    1(DI), R10
        NEGL    AX
        ADCQ    R8, R9
        MOVQ    R9, (DX)(DI*8)
        SBBQ    AX, AX
        NEGQ    AX
        MOVQ    R10, DI
main_example_pc78:
        CMPQ    CX, DI
        JGT     main_example_pc48

The bottleneck is the NEGL -> ADCQ -> SBBQ -> NEGQ dependency chain.

What did you expect to see?

The SBBQ / NEGQ pair should use SETCS instead, e.g.

main_example_pc48:
        MOVQ    (BX)(DI*8), R8
        MOVQ    (SI)(DI*8), R9
        LEAQ    1(DI), R10
        NEGL    AX
        ADCQ    R8, R9
        MOVQ    R9, (DX)(DI*8)
        SETCS   AX
        MOVQ    R10, DI
main_example_pc78:
        CMPQ    CX, DI
        JGT     main_example_pc48

This shortens the dependency chain to three instructions.

bremac avatar Aug 20 '24 05:08 bremac