go icon indicating copy to clipboard operation
go copied to clipboard

runtime: memory corruption crashes

Open tedli opened this issue 4 months ago • 5 comments

Go version

go version go1.21.11 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.11'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build747085668=/tmp/go-build -gno-record-gcc-switches'

What did you do?

A bare metal kubernetes cluster, runs for over half a year. Joined some new nodes into this cluster. Found docker daemon on one new joined node hangs. By restarting the docker daemon, it runs again, but after hours, it will hang again. Built the same version docker daemon with the same golang toolchain as the official binary release (golang 1.21) with debug turing on, replaced the dockerd with the debug version. Repeat restarting docker, hangs, restart, many times. Dumping the calling stack when hangs, it varies, and occasionally it panics (runtime, not app). Trying to use the latest golang release (1.23.1), got no luck, nor did the latest docker release. With godebug gctrace 1, when it hangs, the gc also stopped logging.

By searching the panic messages and the calling stack, found the some issues, but almost all of those are closed due to age.

https://github.com/golang/go/issues/15658 this issue gave a reproduce code, and the code can stably reproduce on this node (golang 1.21, cgo off), and by setting GOMAXPROCS=1, the reproduce code no to crash any more. Turn to use golang 1.9.3, which commented in this issue that includes a fix, to build the reproduce code, and with 1.9.3 built binary, it no to crash any more (without GOMAXPROCS=1).

https://github.com/golang/go/issues/20427

https://github.com/golang/go/blob/b521ebb55a9b26c8824b219376c7f91f7cda6ec2/src/runtime/sys_linux_amd64.s#L222-L229

https://github.com/torvalds/linux/commit/889b3c1245de48ed0cacf7aebb25c489d3e4a3e9#diff-c1a25be6ec9efccf08bb1dd54dd545b0ce4a12f6fc1aba602a78bff5a016a8a4L141

linux removed the CONFIG_OPTIMIZE_INLINING option since 5.4. I tried to follow this manual to rebuild the kernel by hardcoding the inline marco to always_inline (CONFIG_OPTIMIZE_INLINING=no), replace the always_inline kernel, go no luck. But reproduce code seemed live longer, without always inline, reproduce code crash within 10 seconds, it can live up to one minute with always inline. still the 1.9.3 built one never crash.

The poor wretch node is in same specs with others, and was setup using the same ansible script. A full memtest86+ is done shown all passed. Other nodes works as expect, without any touch on any binary.

One thing that, these nodes is in an awful data center and lack of maintenance, thermal issue, dusts made troubles before on other nodes. But it's not seemed like a hardware issue, since it's only breaks golang programs, I can still ssh to login to do operations, the rest of all system components also works as expect.

What did you see happen?

Reproduce code in https://github.com/golang/go/issues/15658, can stably reproduce on my machine.

What did you expect to see?

The reproduce code should not reproduce any more, as it fixed since 1.9.3.

tedli avatar Oct 12 '24 07:10 tedli