go
go copied to clipboard
cmd/link: rare corruption of ELF binaries
What version of Go are you using (go version)?
$ go version go version go1.18 linux/amd64
Does this issue reproduce with the latest release?
Yes, Go 1.18 is the latest (stable) release.
What operating system and processor architecture are you using (go env)?
go env Output
$ go env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/home/michael/.cache/go-build" GOENV="/home/michael/.config/go/env" GOEXE="" GOEXPERIMENT="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="linux" GOINSECURE="" GOMODCACHE="/home/michael/go/pkg/mod" GONOPROXY="" GONOSUMDB="" GOOS="linux" GOPATH="/home/michael/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/home/michael/sdk/go1.18" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/home/michael/sdk/go1.18/pkg/tool/linux_amd64" GOVCS="" GOVERSION="go1.18" GCCGO="gccgo" GOAMD64="v1" AR="ar" CC="gcc" CXX="g++" CGO_ENABLED="1" GOMOD="/home/michael/2022-06-25-squashfs-debug/go.mod" GOWORK="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1868362460=/tmp/go-build -gno-record-gcc-switches"
What did you do?
I am building and deploying Go software from a cron job every day.
Recently, I noticed that sometimes, some of my executable binary files do not start up because they are corrupt!
The first time I noticed the issue, the init binary of one of my https://gokrazy.org/ installations was affected, resulting in an installation that wouldn’t boot at all.
The other time, it wasn’t the init binary, but a program of mine called regelwerk which is involved in motion sensor/light control in my home, so I noticed that because the lights weren’t working as they should.
It’s possible this happened more times and I just didn’t notice it.
Yesterday, I found someone on twitter who is also running into this issue, but with an entirely different program (not related to gokrazy at all): https://twitter.com/alvs_versteck/status/1546601648532983808
What did you expect to see?
The Go compiler/linker should produce ELF binaries that contain a valid ELF header.
What did you see instead?
The first 4096 bytes of the ELF binary are zeroed out, as well as another block of 4096 bytes at offset 256K.
You can find the files at https://t.zekjur.net/_2022-06-25-init/
In the other occurrence, it was 4096 bytes at the start of the ELF binary, then 4096 bytes at offset 0x9000.
Unfortunately I have no idea how to reproduce this issue.
Are the corrupted binaries deterministic?
If not, I wonder if #51611 is related.
Are the corrupted binaries deterministic?
WDYM by that, if they are always corrupted in the same way? If yes, I think so but currently only have one sample lying around where this happened so would need to wait for it to happen again to be able to tell for sure.
Are the corrupted binaries deterministic?
I can’t reproduce the issue unfortunately. When I did another build right after the corrupted build, everything worked.
But what all the 3 different corrupted ELF binaries I have seen (two of my own, one from alvs) have had in common so far is that they had a block of all-zero bytes at their first 4096 bytes.
CC @golang/compiler
Is the corruption always on the ELF header? Is it on the same machine or on multiple machines? What the file system are you running on? Thanks.
Is the corruption always on the ELF header?
I think so, I remember I got a couple of cannot execute binary file: Exec format error errors and rebuilding helped. I only have one lying around right now though and it happens intermittently, so I can't say for sure until it happens again.
Is it on the same machine or on multiple machines?
Multiple, mine and @stapelberg also sees it on his. Mine is a Lenovo t470s:
- Intel Core i7-7500U Prozessor (4 MB Cache, up to 3,5 GHz)
- 35,56 cm (14,0") WQHD (2.560 x 1.440) IPS-Display, Non-Touch
- 24 GB DDR4-2133 – 8 GB Onboard + 16 GB (1 SoDIMM)
- Intel HD-graphics 620
- [Crucial P5 Plus 1TB PCIe 4.0 3D NAND NVMe M.2 SSD, up to 6600MB/s - CT1000P5PSSD8](https://www.amazon.ca/gp/product/B098WL46RS)
What the file system are you running on?
xfs with Linux 5.17.9-200.fc35.x86_64
Is the corruption always on the ELF header?
I don’t know if all instances of this corruption will always affect the ELF header.
What I can say is that whenever I have noticed corruption, the ELF header was corrupted.
Note that ELF header failures are rather loud in comparison (binary can’t be started), so perhaps this is selection bias — other corruption could exist, I just don’t know about it because it doesn’t fail as badly perhaps (just speculating here).
Is it on the same machine or on multiple machines?
Multiple machines: mine (see https://michael.stapelberg.ch/posts/2022-01-15-high-end-linux-pc/ for specs) and @alvaroaleman’s machine (can you share your specs, too, please?)
What the file system are you running on? Thanks.
I’m using ext4 on this Linux 5.17.7 machine, but note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case. The Go build cache is on ~ on ext4, but given that we’re talking about the final ELF binary, I would think that the Go linker should always create that, not copy it from cache.
The build command I’m using is: go build -mod=mod -o /tmp/gokr-packer456452338/init -tags=gokrazy,netgo,osusergo /tmp/gokr-packer456452338/init.go
Thanks. Frequent corruption on the ELF header is probably enough information. The ELF header is written by the linker only. The compiler and (most of) the Go command are unrelated. The one place that the go command touches the binary after linking is stamping the build ID. So it is either the linker or that build ID stamping.
Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)?
Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)?
Mine is pure go (i.E. CGO_ENABLED=0)
I am building with CGO_ENABLED=0, too.
4096 = block size of ext4, file system bug?
4096 = block size of ext4, file system bug?
Again: note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case.
So I don’t think a file system bug is likely here.
Thanks. Pure-Go means that the file is written by the Go linker, not the C linker.
If this is somewhat reproducible, could you try if this patch makes any difference? (You'll need to rebuild the linker, go install cmd/link)
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
import (
"syscall"
+ "unsafe"
)
// Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
if out.buf == nil {
return
}
+ syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
syscall.Munmap(out.buf)
out.buf = nil
}
Also, do you know if this is a new bug, or it occurs with old versions of Go as well? Thanks.
@cherrymui I haven't seen this before using go 1.18, so I guess it is new. I did some more experimenting and found out:
- I can somewhat reliably reproduce this by go building in a loop, in five attempts it always took less than 200 builds
- It is not reproducible in a docker container, so seems specific to my env
- It is reproducible with an empty go cache
- It is not reproducible with a hello world go program, so seems to either be specific to my project or maybe just the size of the codebase
Regarding your patch, I suppose I need to replace the link binary in the GOTOOLDIR? I did that and the issue still happens. I then looked more at the link binary and it has the same sha256sum regardless of applying your patch or not...?
$ g diff
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
import (
"syscall"
+ "unsafe"
)
// Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
if out.buf == nil {
return
}
+ syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
syscall.Munmap(out.buf)
out.buf = nil
}
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ go build -o /tmp/link cmd/link/
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ sha256sum /tmp/link
e7e671ba4d83e6f3a820ba3f0269ef21e26f2c477938499d389f58a696ec7e19 /tmp/link
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ g reset --hard
HEAD is now at 558785a0a9 cmd/compile: remove -installsuffix flag
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ go build -o /tmp/link cmd/link/
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ sha256sum /tmp/link
e7e671ba4d83e6f3a820ba3f0269ef21e26f2c477938499d389f58a696ec7e19 /tmp/link
This is weird. The patch does make a difference on my machine
$ git status
On branch master
Your branch is up to date with 'origin/master'.
nothing to commit, working tree clean
$ git log -1 --oneline
558785a0a9 (HEAD -> master, origin/master, origin/HEAD) cmd/compile: remove -installsuffix flag
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link
173de0c0efed5ea3c57ef497effe803f20c60603889de4c64cb1ac726b7ed45f /tmp/link
$ git apply </tmp/p
$ git diff
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
import (
"syscall"
+ "unsafe"
)
// Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
if out.buf == nil {
return
}
+ syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
syscall.Munmap(out.buf)
out.buf = nil
}
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link
3237cd32022b5b6cee6a9286cdec0d10e7e24455d9ee610553e8caecd8754c81 /tmp/link
$ git reset --hard
HEAD is now at 558785a0a9 cmd/compile: remove -installsuffix flag
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link
173de0c0efed5ea3c57ef497effe803f20c60603889de4c64cb1ac726b7ed45f /tmp/link
Maybe I messed up tabs and spaces when I first pasted the patch, so it didn't directly apply. Try this.
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
import (
"syscall"
+ "unsafe"
)
// Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
if out.buf == nil {
return
}
+ syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
syscall.Munmap(out.buf)
out.buf = nil
}
@alvaroaleman I think the problem you ran into when trying to apply the patch is:
go build -o /tmp/link cmd/link/
Go will not use the local directory, but your installation’s cmd/link source instead. You need to use:
go build -o /tmp/link ./cmd/link/
You can verify the code was updated by using go tool objdump ./pkg/tool/linux_amd64/link. Search the output for outbuf_mmap.go and look for the munmap code. You should see disassembly like:
outbuf_mmap.go:55 0x5ae5d2 4889442450 MOVQ AX, 0x50(SP)
outbuf_mmap.go:59 0x5ae5d7 4889542438 MOVQ DX, 0x38(SP)
outbuf_mmap.go:59 0x5ae5dc 488b4818 MOVQ 0x18(AX), CX
outbuf_mmap.go:59 0x5ae5e0 48c704241a000000 MOVQ $0x1a, 0(SP)
outbuf_mmap.go:59 0x5ae5e8 4889542408 MOVQ DX, 0x8(SP)
outbuf_mmap.go:59 0x5ae5ed 48894c2410 MOVQ CX, 0x10(SP)
outbuf_mmap.go:59 0x5ae5f2 48c744241804000000 MOVQ $0x4, 0x18(SP)
outbuf_mmap.go:59 0x5ae5fb 0f1f440000 NOPL 0(AX)(AX*1)
outbuf_mmap.go:59 0x5ae600 e8bbf0eeff CALL syscall.Syscall.abi0(SB)
outbuf_mmap.go:59 0x5ae605 450f57ff XORPS X15, X15
outbuf_mmap.go:59 0x5ae609 644c8b3425f8ffffff MOVQ FS:0xfffffff8, R14
outbuf_mmap.go:60 0x5ae612 488b442450 MOVQ 0x50(SP), AX
outbuf_mmap.go:60 0x5ae617 488b5810 MOVQ 0x10(AX), BX
outbuf_mmap.go:60 0x5ae61b 488b4818 MOVQ 0x18(AX), CX
outbuf_mmap.go:60 0x5ae61f 488b7820 MOVQ 0x20(AX), DI
syscall_linux.go:1142 0x5ae623 488b1576be2600 MOVQ syscall.mapper(SB), DX
syscall_linux.go:1142 0x5ae62a 4889d0 MOVQ DX, AX
syscall_linux.go:1142 0x5ae62d e88eceeeff CALL syscall.(*mmapper).Munmap(SB)
If you can’t find munmap, look for Close — maybe the code got inlined.
If you only see the syscall.(*mmapper).Munmap call, the patch was not applied correctly. If you see the syscall.Syscall.abi0 call, the patch was applied correctly.
I have applied the patch on my system and will report back if I still encounter the corruption. However, as I can’t trigger it even when building hundreds of times in a loop, it would be better if @alvaroaleman could report back, as it sounds like it happens more frequently for you?
Thanks
- We also encountered the same problem ( #53116) , On some linux/loong64 machines, an "exec format error" error occasionally occurs when executing the make.bash or all.bash. Check these executable files with errors, it is found that the data of 16K(only one pagesize) at the beginning of the file is missing. The result of hexdump is as follows:
1 0000000 0000 0000 0000 0000 0000 0000 0000 0000
2 *
3 0004000 0020 4c00 3d24 1c01 d084 02fc 2064 29c0
4 0004010 9004 02c0 4064 29c0 4800 576a 0000 0014
5 0004020 62d3 28c0 8e73 0012 1260 5c00 0025 0015
This problem is caused by the occasional loss of 16K data at the beginning of the file after the link process calls the syscall.Fallocate function for the second time. However, not all machines can reproduce this error (the file system formats I tested include xfs, ext4, tmfs ).
- my kernel version is 5.17.0+, pagesize is 16K。
- In the case of ensuring enough disk space, I removed the second fallocate call, the problem no longer occurs.
- I tried doing MS_SYNC before munmap, but the problem persists
- I tested the kernel with version 4.19.190, but this problem did not occur
Thanks for adding to this report, @abner-chenc!
I also just encountered this corruption again today, with @cherrymui’s patch applied — so the extra MSYNC does not help unfortunately.
I tried using posix_fallocate (the implementation refers to glib's posix_fallocate.c) replaces syscall.fallcate, and this error does not occur on the same machine , but this will cost performance.
@stapelberg, Here is my implementation of posix_fallocate: https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae
@stapelberg thanks for the hint, using go build -o /tmp/link ./cmd/link just resulted in main module (std) does not contain package std/cmd/link and building from the cmd folder as go build -o /tmp/link ./link still ended up using my installations cmd/link. I ended up applying the patch there, but as you already noted, it doesn't fix the issue.
Seems like @abner-chenc patch is indeed fixing this, I couldn't reproduce the issue anymore in 500 builds when I never needed more than 200 for it to occur prior to their patch.
Are all the Exec format error case that the first page is missing? Or (sometimes) just the ELF header?
Does the fallocate syscall not work on your file system? But it also doesn't return an error? The fallocate isn't really necessary unless you're running low on disk space. Is that the case on your system? What happens if you just comment that fallocate call out?
@abner-chenc your patch also reorders Truncate and fallocate. What happens if you don't reorder them, or just reorder them without doing the rest?
In the case of ensuring enough disk space, I removed the second fallocate call, the problem no longer occurs.
Could you explain what you did exactly? What is "the second fallocate call"?
@alvaroaleman it looks like @abner-chenc 's patch only affects Loong64. Just to make sure, did you change it to apply to your architecture as well?
Thanks.
@abner-chenc are you also running on tmpfs?
@alvaroaleman @stapelberg does it reproduce on a non-tmpfs file system?
Thanks.
@cherrymui I just applied https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae.patch without any changes, then built the link binary. While testing it, I output the binary to /tmp so a tmpfs, but when I orginally encountered the error I had targeted an xfs fs. I can check later if it also fixes the issue when targeting the xfs fs.
That patch should not make any difference if you're not running on Loong64. Would be good to try on xfs. Thanks.
@cherrymui well, but it does. I can reproduce this issue with a link binary without that patch (most of the time in less than 50 tries, all the time so far in less than 200) and can not reproduce it with one that has the patch in 500 tries. I just verified this again, this time both when outputting to tmpfs and xfs which doesn't change the result.
The patch also moves the Truncate call around and that happens on all archs for which outbufs_mmap.go is compiled - Maybe that is why?
Thanks. Could be. Could you try just moving the Truncate call without the rest of the change? Thanks.
@cherrymui yes, reducing the patch to only the changes in outbufs_mmap.go as shown below is enough to make the issue go away. Thanks for taking the time to look into this btw!
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -20,6 +20,10 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
out.munmap()
}
+ err = out.f.Truncate(int64(filesize))
+ if err != nil {
+ Exitf("resize output file failed: %v", err)
+ }
for {
if err = out.fallocate(filesize); err != syscall.EINTR {
break
@@ -33,10 +37,6 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
return err
}
}
- err = out.f.Truncate(int64(filesize))
- if err != nil {
- Exitf("resize output file failed: %v", err)
- }
out.buf, err = syscall.Mmap(int(out.f.Fd()), 0, int(filesize), syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED|syscall.MAP_FILE)
if err != nil {
return err