go icon indicating copy to clipboard operation
go copied to clipboard

cmd/link: rare corruption of ELF binaries

Open stapelberg opened this issue 3 years ago • 41 comments

What version of Go are you using (go version)?

$ go version
go version go1.18 linux/amd64

Does this issue reproduce with the latest release?

Yes, Go 1.18 is the latest (stable) release.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/michael/.cache/go-build"
GOENV="/home/michael/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/michael/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/michael/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/home/michael/sdk/go1.18"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/home/michael/sdk/go1.18/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.18"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/michael/2022-06-25-squashfs-debug/go.mod"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1868362460=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I am building and deploying Go software from a cron job every day.

Recently, I noticed that sometimes, some of my executable binary files do not start up because they are corrupt!

The first time I noticed the issue, the init binary of one of my https://gokrazy.org/ installations was affected, resulting in an installation that wouldn’t boot at all.

The other time, it wasn’t the init binary, but a program of mine called regelwerk which is involved in motion sensor/light control in my home, so I noticed that because the lights weren’t working as they should.

It’s possible this happened more times and I just didn’t notice it.

Yesterday, I found someone on twitter who is also running into this issue, but with an entirely different program (not related to gokrazy at all): https://twitter.com/alvs_versteck/status/1546601648532983808

What did you expect to see?

The Go compiler/linker should produce ELF binaries that contain a valid ELF header.

What did you see instead?

The first 4096 bytes of the ELF binary are zeroed out, as well as another block of 4096 bytes at offset 256K.

You can find the files at https://t.zekjur.net/_2022-06-25-init/

In the other occurrence, it was 4096 bytes at the start of the ELF binary, then 4096 bytes at offset 0x9000.

Unfortunately I have no idea how to reproduce this issue.

stapelberg avatar Jul 12 '22 06:07 stapelberg

Are the corrupted binaries deterministic?

If not, I wonder if #51611 is related.

bcmills avatar Jul 12 '22 16:07 bcmills

Are the corrupted binaries deterministic?

WDYM by that, if they are always corrupted in the same way? If yes, I think so but currently only have one sample lying around where this happened so would need to wait for it to happen again to be able to tell for sure.

alvaroaleman avatar Jul 12 '22 16:07 alvaroaleman

Are the corrupted binaries deterministic?

I can’t reproduce the issue unfortunately. When I did another build right after the corrupted build, everything worked.

But what all the 3 different corrupted ELF binaries I have seen (two of my own, one from alvs) have had in common so far is that they had a block of all-zero bytes at their first 4096 bytes.

stapelberg avatar Jul 12 '22 16:07 stapelberg

CC @golang/compiler

mknyszek avatar Jul 12 '22 17:07 mknyszek

Is the corruption always on the ELF header? Is it on the same machine or on multiple machines? What the file system are you running on? Thanks.

cherrymui avatar Jul 12 '22 17:07 cherrymui

Is the corruption always on the ELF header?

I think so, I remember I got a couple of cannot execute binary file: Exec format error errors and rebuilding helped. I only have one lying around right now though and it happens intermittently, so I can't say for sure until it happens again.

Is it on the same machine or on multiple machines?

Multiple, mine and @stapelberg also sees it on his. Mine is a Lenovo t470s:

- Intel Core i7-7500U Prozessor (4 MB Cache, up to 3,5 GHz)
- 35,56 cm (14,0") WQHD (2.560 x 1.440) IPS-Display, Non-Touch
- 24 GB DDR4-2133 – 8 GB Onboard + 16 GB (1 SoDIMM)
- Intel HD-graphics 620
- [Crucial P5 Plus 1TB PCIe 4.0 3D NAND NVMe M.2 SSD, up to 6600MB/s - CT1000P5PSSD8](https://www.amazon.ca/gp/product/B098WL46RS)

What the file system are you running on?

xfs with Linux 5.17.9-200.fc35.x86_64

alvaroaleman avatar Jul 12 '22 18:07 alvaroaleman

Is the corruption always on the ELF header?

I don’t know if all instances of this corruption will always affect the ELF header.

What I can say is that whenever I have noticed corruption, the ELF header was corrupted.

Note that ELF header failures are rather loud in comparison (binary can’t be started), so perhaps this is selection bias — other corruption could exist, I just don’t know about it because it doesn’t fail as badly perhaps (just speculating here).

Is it on the same machine or on multiple machines?

Multiple machines: mine (see https://michael.stapelberg.ch/posts/2022-01-15-high-end-linux-pc/ for specs) and @alvaroaleman’s machine (can you share your specs, too, please?)

What the file system are you running on? Thanks.

I’m using ext4 on this Linux 5.17.7 machine, but note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case. The Go build cache is on ~ on ext4, but given that we’re talking about the final ELF binary, I would think that the Go linker should always create that, not copy it from cache.

The build command I’m using is: go build -mod=mod -o /tmp/gokr-packer456452338/init -tags=gokrazy,netgo,osusergo /tmp/gokr-packer456452338/init.go

stapelberg avatar Jul 12 '22 18:07 stapelberg

Thanks. Frequent corruption on the ELF header is probably enough information. The ELF header is written by the linker only. The compiler and (most of) the Go command are unrelated. The one place that the go command touches the binary after linking is stamping the build ID. So it is either the linker or that build ID stamping.

Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)?

cherrymui avatar Jul 12 '22 18:07 cherrymui

Is your program a pure-Go binary or it uses cgo (i.e. whether it is internal linking or external linking)?

Mine is pure go (i.E. CGO_ENABLED=0)

alvaroaleman avatar Jul 12 '22 18:07 alvaroaleman

I am building with CGO_ENABLED=0, too.

stapelberg avatar Jul 12 '22 19:07 stapelberg

4096 = block size of ext4, file system bug?

mengzhuo avatar Jul 13 '22 03:07 mengzhuo

4096 = block size of ext4, file system bug?

Again: note that both the source code (generated) and the binary output location are on /tmp, which is a tmpfs mount in my case.

So I don’t think a file system bug is likely here.

stapelberg avatar Jul 13 '22 06:07 stapelberg

Thanks. Pure-Go means that the file is written by the Go linker, not the C linker.

If this is somewhat reproducible, could you try if this patch makes any difference? (You'll need to rebuild the linker, go install cmd/link)

diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
        "syscall"
+       "unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
        if out.buf == nil {
                return
        }
+       syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
        syscall.Munmap(out.buf)
        out.buf = nil
 }

cherrymui avatar Jul 13 '22 16:07 cherrymui

Also, do you know if this is a new bug, or it occurs with old versions of Go as well? Thanks.

cherrymui avatar Jul 13 '22 16:07 cherrymui

@cherrymui I haven't seen this before using go 1.18, so I guess it is new. I did some more experimenting and found out:

  • I can somewhat reliably reproduce this by go building in a loop, in five attempts it always took less than 200 builds
  • It is not reproducible in a docker container, so seems specific to my env
  • It is reproducible with an empty go cache
  • It is not reproducible with a hello world go program, so seems to either be specific to my project or maybe just the size of the codebase

Regarding your patch, I suppose I need to replace the link binary in the GOTOOLDIR? I did that and the issue still happens. I then looked more at the link binary and it has the same sha256sum regardless of applying your patch or not...?

$ g diff
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
        "syscall"
+       "unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
        if out.buf == nil {
                return
        }
+       syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
        syscall.Munmap(out.buf)
        out.buf = nil
 }
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ go build -o /tmp/link cmd/link/
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ sha256sum /tmp/link
e7e671ba4d83e6f3a820ba3f0269ef21e26f2c477938499d389f58a696ec7e19  /tmp/link
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ g reset --hard
HEAD is now at 558785a0a9 cmd/compile: remove -installsuffix flag
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ go build -o /tmp/link cmd/link/
[2022-07-13-23:17]:[app.ci@ci][master][1]~/.../goroot/src
$ sha256sum /tmp/link
e7e671ba4d83e6f3a820ba3f0269ef21e26f2c477938499d389f58a696ec7e19  /tmp/link

alvaroaleman avatar Jul 14 '22 03:07 alvaroaleman

This is weird. The patch does make a difference on my machine

$ git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
$ git log -1 --oneline
558785a0a9 (HEAD -> master, origin/master, origin/HEAD) cmd/compile: remove -installsuffix flag
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link 
173de0c0efed5ea3c57ef497effe803f20c60603889de4c64cb1ac726b7ed45f  /tmp/link
$ git apply </tmp/p
$ git diff
diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
        "syscall"
+       "unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
        if out.buf == nil {
                return
        }
+       syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
        syscall.Munmap(out.buf)
        out.buf = nil
 }
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link 
3237cd32022b5b6cee6a9286cdec0d10e7e24455d9ee610553e8caecd8754c81  /tmp/link
$ git reset --hard
HEAD is now at 558785a0a9 cmd/compile: remove -installsuffix flag
$ go build -o /tmp/link cmd/link
$ sha256sum /tmp/link 
173de0c0efed5ea3c57ef497effe803f20c60603889de4c64cb1ac726b7ed45f  /tmp/link

cherrymui avatar Jul 14 '22 15:07 cherrymui

Maybe I messed up tabs and spaces when I first pasted the patch, so it didn't directly apply. Try this.

diff --git a/src/cmd/link/internal/ld/outbuf_mmap.go b/src/cmd/link/internal/ld/outbuf_mmap.go
index 40a3222788..c3cc157579 100644
--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -9,6 +9,7 @@ package ld
 
 import (
 	"syscall"
+	"unsafe"
 )
 
 // Mmap maps the output file with the given size. It unmaps the old mapping
@@ -55,6 +56,7 @@ func (out *OutBuf) munmap() {
 	if out.buf == nil {
 		return
 	}
+	syscall.Syscall(syscall.SYS_MSYNC, uintptr(unsafe.Pointer(&out.buf[0])), uintptr(len(out.buf)), syscall.MS_SYNC) // XXX
 	syscall.Munmap(out.buf)
 	out.buf = nil
 }

cherrymui avatar Jul 14 '22 15:07 cherrymui

@alvaroaleman I think the problem you ran into when trying to apply the patch is:

go build -o /tmp/link cmd/link/

Go will not use the local directory, but your installation’s cmd/link source instead. You need to use:

go build -o /tmp/link ./cmd/link/

You can verify the code was updated by using go tool objdump ./pkg/tool/linux_amd64/link. Search the output for outbuf_mmap.go and look for the munmap code. You should see disassembly like:

  outbuf_mmap.go:55     0x5ae5d2                4889442450              MOVQ AX, 0x50(SP)                               
  outbuf_mmap.go:59     0x5ae5d7                4889542438              MOVQ DX, 0x38(SP)                               
  outbuf_mmap.go:59     0x5ae5dc                488b4818                MOVQ 0x18(AX), CX                               
  outbuf_mmap.go:59     0x5ae5e0                48c704241a000000        MOVQ $0x1a, 0(SP)                               
  outbuf_mmap.go:59     0x5ae5e8                4889542408              MOVQ DX, 0x8(SP)                                
  outbuf_mmap.go:59     0x5ae5ed                48894c2410              MOVQ CX, 0x10(SP)                               
  outbuf_mmap.go:59     0x5ae5f2                48c744241804000000      MOVQ $0x4, 0x18(SP)                             
  outbuf_mmap.go:59     0x5ae5fb                0f1f440000              NOPL 0(AX)(AX*1)                                
  outbuf_mmap.go:59     0x5ae600                e8bbf0eeff              CALL syscall.Syscall.abi0(SB)                   
  outbuf_mmap.go:59     0x5ae605                450f57ff                XORPS X15, X15                                  
  outbuf_mmap.go:59     0x5ae609                644c8b3425f8ffffff      MOVQ FS:0xfffffff8, R14                         
  outbuf_mmap.go:60     0x5ae612                488b442450              MOVQ 0x50(SP), AX                               
  outbuf_mmap.go:60     0x5ae617                488b5810                MOVQ 0x10(AX), BX                               
  outbuf_mmap.go:60     0x5ae61b                488b4818                MOVQ 0x18(AX), CX                               
  outbuf_mmap.go:60     0x5ae61f                488b7820                MOVQ 0x20(AX), DI                               
  syscall_linux.go:1142 0x5ae623                488b1576be2600          MOVQ syscall.mapper(SB), DX                     
  syscall_linux.go:1142 0x5ae62a                4889d0                  MOVQ DX, AX                                     
  syscall_linux.go:1142 0x5ae62d                e88eceeeff              CALL syscall.(*mmapper).Munmap(SB)              

If you can’t find munmap, look for Close — maybe the code got inlined.

If you only see the syscall.(*mmapper).Munmap call, the patch was not applied correctly. If you see the syscall.Syscall.abi0 call, the patch was applied correctly.

I have applied the patch on my system and will report back if I still encounter the corruption. However, as I can’t trigger it even when building hundreds of times in a loop, it would be better if @alvaroaleman could report back, as it sounds like it happens more frequently for you?

Thanks

stapelberg avatar Jul 18 '22 08:07 stapelberg

  • We also encountered the same problem ( #53116) , On some linux/loong64 machines, an "exec format error" error occasionally occurs when executing the make.bash or all.bash. Check these executable files with errors, it is found that the data of 16K(only one pagesize) at the beginning of the file is missing. The result of hexdump is as follows:
   1  0000000 0000 0000 0000 0000 0000 0000 0000 0000
   2  *
   3  0004000 0020 4c00 3d24 1c01 d084 02fc 2064 29c0
   4  0004010 9004 02c0 4064 29c0 4800 576a 0000 0014
   5  0004020 62d3 28c0 8e73 0012 1260 5c00 0025 0015

This problem is caused by the occasional loss of 16K data at the beginning of the file after the link process calls the syscall.Fallocate function for the second time. However, not all machines can reproduce this error (the file system formats I tested include xfs, ext4, tmfs ).

  • my kernel version is 5.17.0+, pagesize is 16K。
  • In the case of ensuring enough disk space, I removed the second fallocate call, the problem no longer occurs.
  • I tried doing MS_SYNC before munmap, but the problem persists
  • I tested the kernel with version 4.19.190, but this problem did not occur

abner-chenc avatar Jul 20 '22 03:07 abner-chenc

Thanks for adding to this report, @abner-chenc!

I also just encountered this corruption again today, with @cherrymui’s patch applied — so the extra MSYNC does not help unfortunately.

stapelberg avatar Jul 20 '22 06:07 stapelberg

I tried using posix_fallocate (the implementation refers to glib's posix_fallocate.c) replaces syscall.fallcate, and this error does not occur on the same machine , but this will cost performance.

@stapelberg, Here is my implementation of posix_fallocate: https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae

abner-chenc avatar Jul 20 '22 07:07 abner-chenc

@stapelberg thanks for the hint, using go build -o /tmp/link ./cmd/link just resulted in main module (std) does not contain package std/cmd/link and building from the cmd folder as go build -o /tmp/link ./link still ended up using my installations cmd/link. I ended up applying the patch there, but as you already noted, it doesn't fix the issue.

alvaroaleman avatar Jul 20 '22 15:07 alvaroaleman

Seems like @abner-chenc patch is indeed fixing this, I couldn't reproduce the issue anymore in 500 builds when I never needed more than 200 for it to occur prior to their patch.

alvaroaleman avatar Jul 20 '22 15:07 alvaroaleman

Are all the Exec format error case that the first page is missing? Or (sometimes) just the ELF header?

Does the fallocate syscall not work on your file system? But it also doesn't return an error? The fallocate isn't really necessary unless you're running low on disk space. Is that the case on your system? What happens if you just comment that fallocate call out?

@abner-chenc your patch also reorders Truncate and fallocate. What happens if you don't reorder them, or just reorder them without doing the rest?

In the case of ensuring enough disk space, I removed the second fallocate call, the problem no longer occurs.

Could you explain what you did exactly? What is "the second fallocate call"?

@alvaroaleman it looks like @abner-chenc 's patch only affects Loong64. Just to make sure, did you change it to apply to your architecture as well?

Thanks.

cherrymui avatar Jul 20 '22 16:07 cherrymui

@abner-chenc are you also running on tmpfs?

@alvaroaleman @stapelberg does it reproduce on a non-tmpfs file system?

Thanks.

cherrymui avatar Jul 20 '22 16:07 cherrymui

@cherrymui I just applied https://github.com/abner-chenc/go-abner/commit/c43a4a75a031ac9550eb02697b50032f1dced9ae.patch without any changes, then built the link binary. While testing it, I output the binary to /tmp so a tmpfs, but when I orginally encountered the error I had targeted an xfs fs. I can check later if it also fixes the issue when targeting the xfs fs.

alvaroaleman avatar Jul 20 '22 19:07 alvaroaleman

That patch should not make any difference if you're not running on Loong64. Would be good to try on xfs. Thanks.

cherrymui avatar Jul 20 '22 20:07 cherrymui

@cherrymui well, but it does. I can reproduce this issue with a link binary without that patch (most of the time in less than 50 tries, all the time so far in less than 200) and can not reproduce it with one that has the patch in 500 tries. I just verified this again, this time both when outputting to tmpfs and xfs which doesn't change the result.

The patch also moves the Truncate call around and that happens on all archs for which outbufs_mmap.go is compiled - Maybe that is why?

alvaroaleman avatar Jul 20 '22 22:07 alvaroaleman

Thanks. Could be. Could you try just moving the Truncate call without the rest of the change? Thanks.

cherrymui avatar Jul 20 '22 22:07 cherrymui

@cherrymui yes, reducing the patch to only the changes in outbufs_mmap.go as shown below is enough to make the issue go away. Thanks for taking the time to look into this btw!

--- a/src/cmd/link/internal/ld/outbuf_mmap.go
+++ b/src/cmd/link/internal/ld/outbuf_mmap.go
@@ -20,6 +20,10 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
                out.munmap()
        }
 
+       err = out.f.Truncate(int64(filesize))
+       if err != nil {
+               Exitf("resize output file failed: %v", err)
+       }
        for {
                if err = out.fallocate(filesize); err != syscall.EINTR {
                        break
@@ -33,10 +37,6 @@ func (out *OutBuf) Mmap(filesize uint64) (err error) {
                        return err
                }
        }
-       err = out.f.Truncate(int64(filesize))
-       if err != nil {
-               Exitf("resize output file failed: %v", err)
-       }
        out.buf, err = syscall.Mmap(int(out.f.Fd()), 0, int(filesize), syscall.PROT_READ|syscall.PROT_WRITE, syscall.MAP_SHARED|syscall.MAP_FILE)
        if err != nil {
                return err

alvaroaleman avatar Jul 20 '22 23:07 alvaroaleman