go icon indicating copy to clipboard operation
go copied to clipboard

runtime: unexpected return pc crash on linux-amd64-alpine builder

Open rsc opened this issue 2 years ago • 2 comments

The revived linux-amd64-alpine builder has flaked twice in its short new lifetime with 'unexpected return pc' crashes during the cgo tests.

Here is a repro case using a gomote (note that if you ssh in, you have to set up your environment manually, and in particular you have to put /workdir/go/bin at the front of PATH and have to set GOROOT_BOOTSTRAP=/workdir/go1.4). Not sure why the environment is so messed up on Alpine. gomote run does not have these problems, only gomote ssh.

VM=$(gomote create linux-amd64-alpine)
gomote push $VM
gomote run $VM go/src/make.bash
gomote put -mode 0777 $VM - try.sh <<'EOF'
#!/bin/bash
cd /workdir/go/misc/cgo/test
for i in $(seq 100); do 
    date
    if ! /workdir/go/bin/go test >log 2>&1; then
        cat log
    fi
done
EOF
gomote run $VM try.sh

You may need to repeat the try.sh a few times depending on how flaky the machine is feeling but most runs get at least one failure.

Here are some failures from that script:

runtime: g 3: unexpected return pc for runtime.gcenable.func1 called from 0x0
stack: frame={sp:0xc0000557c8, fp:0xc0000557e0} stack=[0xc000055000,0xc000055800)
0x000000c0000556c8:  0x000000c000055750  0x000000000040d21d <runtime.chansend+0x000000000000055d> 
0x000000c0000556d8:  0x0000000000581220  0x000000c00007e060 
0x000000c0000556e8:  0x00000000005e9f78  0x0000000000000000 
0x000000c0000556f8:  0x0000000000000000  0x0000000000000000 
0x000000c000055708:  0x0000000000000000  0x0000000000000000 
0x000000c000055718:  0x000000c00007e058  0x0000000000000000 
0x000000c000055728:  0x0000000000000000  0x0000000000000000 
0x000000c000055738:  0x0000000000000000  0x0000000000000000 
0x000000c000055748:  0x0000000000000000  0x000000c000055780 
0x000000c000055758:  0x000000000040cc9d <runtime.chansend1+0x000000000000001d>  0x000000c00007e000 
0x000000c000055768:  0x0000000000440bb6 <runtime.gopark+0x00000000000000d6>  0x0000000000000001 
0x000000c000055778:  0x0000000000000000  0x000000c0000557b8 
0x000000c000055788:  0x000000000042ba2e <runtime.bgsweep+0x000000000000008e>  0x0000000000000000 
0x000000c000055798:  0x0000000000000000  0x0000000000000000 
0x000000c0000557a8:  0x0000000000000000  0x000000c00007e000 
0x000000c0000557b8:  0x000000c0000557d0  0x0000000000420706 <runtime.gcenable.func1+0x0000000000000026> 
0x000000c0000557c8: <0x00007f8a890934b6  0x00007f8a61816b64 
0x000000c0000557d8: !0x0000000000000000 >0x0000000000000000 
0x000000c0000557e8:  0x0000000000000000  0x00007f8a890d3600 
0x000000c0000557f8:  0x00007f8a89092acf 
fatal error: unknown caller pc
runtime: g 19: unexpected return pc for runtime.gcenable.func2 called from 0x0
stack: frame={sp:0xc000050fc8, fp:0xc000050fe0} stack=[0xc000050800,0xc000051000)
0x000000c000050ec8:  0x000000000000000e  0x000000c0000061a0 
0x000000c000050ed8:  0x000000c000050f60  0x000000000040d265 <runtime.chansend+0x00000000000005a5> 
0x000000c000050ee8:  0x0000000000000050  0x000000c00009c000 
0x000000c000050ef8:  0x0000000000000000  0x0000010000000000 
0x000000c000050f08:  0x0000000000000003  0x0000000000000030 
0x000000c000050f18:  0x0000000000000000  0x0000000000000050 
0x000000c000050f28:  0x000000c000096058  0x000000c00007e000 
0x000000c000050f38:  0x0000000000000000  0x0000000000000000 
0x000000c000050f48:  0x0000000000440bb6 <runtime.gopark+0x00000000000000d6>  0x000000000040d320 <runtime.chansend.func1+0x0000000000000000> 
0x000000c000050f58:  0x000000c000096000  0x000000c000050f90 
0x000000c000050f68:  0x0000000000429ad3 <runtime.(*scavengerState).park+0x0000000000000053>  0x000000c000096000 
0x000000c000050f78:  0x00000000005e9f78  0x0000000000000001 
0x000000c000050f88:  0x0000000000000000  0x000000c000050fb8 
0x000000c000050f98:  0x000000000042a0a5 <runtime.bgscavenge+0x0000000000000045>  0x00000000006f9960 
0x000000c000050fa8:  0x0000000000000000  0x000000c000096000 
0x000000c000050fb8:  0x000000c000050fd0  0x00000000004206a6 <runtime.gcenable.func2+0x0000000000000026> 
0x000000c000050fc8: <0x00007f47256144b6  0x00007f46fdea3b64 
0x000000c000050fd8: !0x0000000000000000 >0x0000000000000000 
0x000000c000050fe8:  0x0000000000000000  0x00007f4725654600 
0x000000c000050ff8:  0x00007f4725613acf 
fatal error: unknown caller pc

This one did not happen during garbage collection:

runtime: g 20: unexpected return pc for testing.tRunner called from 0x7feeabb0dacf
stack: frame={sp:0xc000051770, fp:0xc0000517c0} stack=[0xc000051000,0xc000051800)
0x000000c000051670:  0x000000012a05f200  0x000000c0000880a0 
0x000000c000051680:  0x000000c000094180  0x000000c0000516f8 
0x000000c000051690:  0x000000c000102b80  0x000000c000102b60 
0x000000c0000516a0:  0x0000000000000000  0x00000000005890c0 
0x000000c0000516b0:  0x00000000006d7d50  0x0000000000000000 
0x000000c0000516c0:  0x0000000000000000  0x0000000000000000 
0x000000c0000516d0:  0x0000000000000000  0x000000c000051730 
0x000000c0000516e0:  0x0000000000454a36 <runtime.sigpanic+0x00000000000002f6>  0x00000000005890c0 
0x000000c0000516f0:  0x00000000006d7d50  0x000000c000051748 
0x000000c000051700:  0x0000000000561ceb <misc/cgo/test.testSetgid+0x00000000000000ab>  0x000000c0001121e0 
0x000000c000051710:  0x000000c000102b60  0x0000000000000001 
0x000000c000051720:  0x00000000006ea660  0x00000000005eb418 
0x000000c000051730:  0x000000c000051760  0x0000000000478bfe <sync.(*RWMutex).Lock+0x000000000000001e> 
0x000000c000051740:  0x0000000000000000  0x000000c000051760 
0x000000c000051750:  0x0000000000526bd9 <misc/cgo/test.TestSetgid+0x0000000000000019>  0x000000c0001029c0 
0x000000c000051760:  0x000000c0000517b0  0x00000000004d6d15 <testing.tRunner+0x0000000000000115> 
0x000000c000051770: <0x0000000000000000  0x0300000000000000 
0x000000c000051780:  0x00000000004d6d80 <testing.tRunner.func2+0x0000000000000000>  0x00007feeabb0e4b6 
0x000000c000051790:  0x00007feeabb4ed8c  0x0000000000000000 
0x000000c0000517a0:  0x0000000000000000  0x0000000000000000 
0x000000c0000517b0:  0x00007feeabb4e600 !0x00007feeabb0dacf 
0x000000c0000517c0: >0x0000000000000000  0x00000000ffffffff 
0x000000c0000517d0:  0x0000000000000000  0x00000000004710a1 <runtime.goexit+0x0000000000000001> 
0x000000c0000517e0:  0x0000000000000000  0x0000000000000000 
0x000000c0000517f0:  0x0000000000000000  0x00007feeabb0e5d2 
fatal error: unknown caller pc

runtime stack:
runtime.throw({0x5ae5a1?, 0x6ea660?})
	/workdir/go/src/runtime/panic.go:1047 +0x5d fp=0x7fee843e3648 sp=0x7fee843e3618 pc=0x43de7d
runtime.gentraceback(0x100000000467aba?, 0xc000100000?, 0xc000102b60?, 0x7fee843e3a18?, 0x0, 0x0, 0x7fffffff, 0x7fee843e3a08, 0x0?, 0x0)
	/workdir/go/src/runtime/traceback.go:258 +0x1cf7 fp=0x7fee843e39b8 sp=0x7fee843e3648 pc=0x4658b7
runtime.addOneOpenDeferFrame.func1()
	/workdir/go/src/runtime/panic.go:645 +0x6b fp=0x7fee843e3a30 sp=0x7fee843e39b8 pc=0x43d00b
runtime.systemstack()
	/workdir/go/src/runtime/asm_amd64.s:492 +0x49 fp=0x7fee843e3a38 sp=0x7fee843e3a30 pc=0x46eee9

goroutine 20 [running]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:459 fp=0xc0000515e8 sp=0xc0000515e0 pc=0x46ee80
runtime.addOneOpenDeferFrame(0xc0000221e0?, 0xc000094180?, 0xc000112180?)
	/workdir/go/src/runtime/panic.go:644 +0x69 fp=0xc000051628 sp=0xc0000515e8 pc=0x43cf49
panic({0x5890c0, 0x6d7d50})
	/workdir/go/src/runtime/panic.go:844 +0x112 fp=0xc0000516e8 sp=0xc000051628 pc=0x43d792
runtime.panicmem(...)
	/workdir/go/src/runtime/panic.go:260
runtime.sigpanic()
	/workdir/go/src/runtime/signal_unix.go:837 +0x2f6 fp=0xc000051740 sp=0xc0000516e8 pc=0x454a36
sync.(*RWMutex).Lock(0x0?)
	/workdir/go/src/sync/rwmutex.go:147 +0x1e fp=0xc000051770 sp=0xc000051740 pc=0x478bfe

Here are the two build dashboard failures:

https://build.golang.org/log/658036e08c7a1d218c33808fdd1d6612b40502d8

runtime: g 2: unexpected return pc for runtime.forcegchelper called from 0x0
stack: frame={sp:0xc000056fb0, fp:0xc000056fe0} stack=[0xc000056800,0xc000057000)
0x000000c000056eb0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ec0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ed0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ee0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ef0:  0x0000000000000000  0x0000000000000000 
0x000000c000056f00:  0x0000000000000000  0x0000000000000000 
0x000000c000056f10:  0x0000000000000000  0x0000000000000000 
0x000000c000056f20:  0x0000000000000000  0x0000000000000000 
0x000000c000056f30:  0x0000000000000000  0x0000000000000000 
0x000000c000056f40:  0x0000000000000000  0x0000000000000000 
0x000000c000056f50:  0x0000000000000000  0x0000000000000000 
0x000000c000056f60:  0x0000000000000000  0x0000000000000000 
0x000000c000056f70:  0x0000000000000000  0x0000000000000000 
0x000000c000056f80:  0x0000000000000000  0x00005637530dbdb6 <runtime.gopark+0x00000000000000d6> 
0x000000c000056f90:  0x0000000000000000  0x0000000000000000 
0x000000c000056fa0:  0x000000c000056fd0  0x00005637530dbc4d <runtime.forcegchelper+0x00000000000000ad> 
0x000000c000056fb0: <0x0000000000000000  0x0000000000000000 
0x000000c000056fc0:  0x0000000000000000  0x00007efee325e4b6 
0x000000c000056fd0:  0x00007efebba04b64 !0x0000000000000000 
0x000000c000056fe0: >0x0000000000000000  0x0000000000000000 
0x000000c000056ff0:  0x00007efee329e600  0x00007efee325dacf 
fatal error: unknown caller pc

and

https://build.golang.org/log/94cf14d78b116487dc76a921baf6ba76480a4c7a

runtime: g 5: unexpected return pc for runtime.sigpanic called from 0x7f52c162dd8c
stack: frame={sp:0xc000058700, fp:0xc000058758} stack=[0xc000058000,0xc000058800)
0x000000c000058600:  0x0000564cf403107b <runtime.write+0x000000000000003b>  0x0000000000000002 
0x000000c000058610:  0x000000c000058648  0x0000564cf40109ce <runtime.recordForPanic+0x000000000000004e> 
0x000000c000058620:  0x0000564cf403107b <runtime.write+0x000000000000003b>  0x0000000000000002 
0x000000c000058630:  0x0000564cf4144017  0x0000000000000001 
0x000000c000058640:  0x0000000000000001  0x000000c000058680 
0x000000c000058650:  0x0000564cf4010cd2 <runtime.gwrite+0x00000000000000f2>  0x0000564cf4144017 
0x000000c000058660:  0x0000000000000001  0x0000000000000001 
0x000000c000058670:  0x000000c0000586e2  0x000000000000000e 
0x000000c000058680:  0x0000564cf4040210 <runtime.systemstack+0x0000000000000030>  0x0000564cf400f3cc <runtime.fatalthrow+0x000000000000006c> 
0x000000c000058690:  0x000000c0000586a0  0x000000c000007ba0 
0x000000c0000586a0:  0x0000564cf400f400 <runtime.fatalthrow.func1+0x0000000000000000>  0x000000c000007ba0 
0x000000c0000586b0:  0x0000564cf400f07f <runtime.throw+0x000000000000005f>  0x000000c0000586d0 
0x000000c0000586c0:  0x000000c0000586f0  0x0000564cf400f07f <runtime.throw+0x000000000000005f> 
0x000000c0000586d0:  0x000000c0000586d8  0x0000564cf400f0a0 <runtime.throw.func1+0x0000000000000000> 
0x000000c0000586e0:  0x0000564cf414445e  0x0000000000000005 
0x000000c0000586f0:  0x000000c000058748  0x0000564cf4025ca5 <runtime.sigpanic+0x00000000000002c5> 
0x000000c000058700: <0x0000564cf414445e  0x000000c0000161e0 
0x000000c000058710:  0x000000c000058728  0x0000000000000001 
0x000000c000058720:  0x00007f52c162dd8c  0x000000c000007ba0 
0x000000c000058730:  0x0000564cf41800e0  0x0000564cf40a7e14 <testing.tRunner+0x0000000000000034> 
0x000000c000058740:  0x0000000000000000  0x00007f52c15ed4b6 
0x000000c000058750: !0x00007f52c162dd8c >0x0000000000000000 
0x000000c000058760:  0x0000000000000000  0x0000000000000000 
0x000000c000058770:  0x00007f52c162d600  0x00007f52c15ecacf 
0x000000c000058780:  0x0000000000000000  0x00000000ffffffff 
0x000000c000058790:  0x0000564cf40a7fa0 <testing.tRunner.func1+0x0000000000000000>  0x000000c000007a00 
0x000000c0000587a0:  0x000000c000058780  0x000000c000058790 
0x000000c0000587b0:  0x000000c0000587d0  0x00007f52c15ed5d2 
0x000000c0000587c0:  0x00007f52c15f0080  0x00007f52c162d600 
0x000000c0000587d0:  0x00000000ffffffff  0x00007f52c15efbbb 
0x000000c0000587e0:  0x0000000000000000  0x00007f52c15efb6d 
0x000000c0000587f0:  0x00007f52c162d604  0x0000000000000000 

Perhaps this is Alpine-specific, or perhaps it is musl-related. The Alpine image may have an old Linux kernel; maybe we should update it.

There are a few other open 'unexpected return pc' issues. Maybe they are all stale:

  • #47003 is Go 1.16 on Ubuntu.
  • #35005 is Go 1.13 on Alpine 3.10 (but disappears on Debian and on Alpine 3.9.4).
  • #40401 is Go 1.14.6 on Windows
  • #40469 is Go 1.13.14 on Windows
  • #51707 is Go 1.16.2 on an unspecified system.
  • #43496 is Go 1.15.6 on Debian (Docker golang image).

#35005 is the most interesting one but the repro case is a very large program running under Docker.

rsc avatar Aug 05 '22 20:08 rsc

Change https://go.dev/cl/422097 mentions this issue: env/linux-x86-alpine: update to Alpine 3.16

gopherbot avatar Aug 08 '22 21:08 gopherbot

@rsc Assigning to you right now while you're updating the image, but feel free to unassign once you're done.

Updates from internal discussion:

  • Seems to reproduce after the update, too.
  • Maybe related to stack smashing protections?

CC @cherrymui @mdempsky

mknyszek avatar Aug 10 '22 19:08 mknyszek

I updated the image already, just need to submit the CL.

rsc avatar Aug 10 '22 21:08 rsc

Since the 10th, when https://go.dev/cl/422097 was submitted. (I'm almost certain the coordinator has been redeployed since then).

2022-08-16T20:39:44-e49e876/linux-amd64-alpine 2022-08-12T16:38:52-f001df5/linux-amd64-alpine 2022-08-12T01:51:51-449691b/linux-amd64-alpine

(Edit: rereading the CL, I see it wasn't intended to fix this)

prattmic avatar Aug 16 '22 21:08 prattmic

This is reproducible with go test in misc/cgo/test. Specifically, on a linux-amd64-alpine gomote, I had success with:

$ for i in $(seq 1 100); do gomote run -dir ./go/misc/cgo/test $INSTANCE ./go/bin/go test -count 1; done

prattmic avatar Aug 16 '22 21:08 prattmic

Elsewhere, @cherrymui mentioned that this looks like it could be corruption from a stack overflow. I agree. In the partial trace below, everything below 0xc000122770 looks like the Go text or heap address, but above 0xc000122770, we see lots of system-looking addresses 0x00007fd.... And this is right at the top of the stack, where the stack below may be overflowing.

goroutine 37 [syscall (scan)]:
runtime: g 37: unexpected return pc for runtime.notetsleepg called from 0x0
stack: frame={sp:0xc000122768, fp:0xc0001227a0} stack=[0xc000122000,0xc000122800)
0x000000c000122000:  0x0000000000000000  0x0000000000000000 
... mostly zeroes ...
0x000000c0001225f0:  0x0000000000000000  0x0000000000000000 
0x000000c000122600:  0x0000000000000000  0x0000000000000000 
0x000000c000122610:  0x0000000000000000  0x0000000000000000 
0x000000c000122620:  0x0000000000000000  0x0000000000000000 
0x000000c000122630:  0x0000000000000000  0x0000000000000000 
0x000000c000122640:  0x000000c046505845  0x000000c000122688 
0x000000c000122650:  0x0000000000487d77 <time.NewTimer+0x00000000000000b7>  0x000000012a05f200 
0x000000c000122660:  0x0000000000000001  0x000005ff66f38c41 
0x000000c000122670:  0x000000012a05f200  0x000000c000118050 
0x000000c000122680:  0x000000c000126060  0x000000c0001226f8 
0x000000c000122690:  0x0000000000564979 <misc/cgo/test.runTestSetgid+0x0000000000000079>  0x0000000000606b28 
0x000000c0001226a0:  0x0000000000000000  0x000000000061acf3 
0x000000c0001226b0:  0x0000000000000022  0x000000000000051b 
0x000000c0001226c0:  0x00000000004d9940 <testing.tRunner+0x0000000000000000>  0x00000000006af138 
0x000000c0001226d0:  0x000000000043bdd6 <runtime.futexsleep+0x0000000000000036>  0x0000000000773660 
0x000000c0001226e0:  0x0000000000000080  0x0000000000000000 
0x000000c0001226f0:  0x0000000000000000  0x0000000000000000 
0x000000c000122700:  0x00000000005bf4f0  0x0000000300000002 
0x000000c000122710:  0x000000c00010a820  0x000000c000122758 
0x000000c000122720:  0x0000000000415045 <runtime.notetsleep_internal+0x0000000000000185>  0x000000c000122760 
0x000000c000122730:  0x00000000004d97a5 <testing.callerName+0x0000000000000045>  0x00000000004d9974 <testing.tRunner+0x0000000000000034> 
0x000000c000122740:  0x0000000000000000  0xffffffffffffffff 
0x000000c000122750:  0x000000c00010a820  0x000000c000122790 
0x000000c000122760:  0x0000000000415165 <runtime.notetsleepg+0x0000000000000045> <0x0000000000773660 
0x000000c000122770:  0x0000000000000000  0x00007fd7fa3c46fa 
0x000000c000122780:  0x00007fd7fa408b84  0x0000000000000000 
0x000000c000122790:  0x0000000000000000 !0x0000000000000000 
0x000000c0001227a0: >0xffffffffffffffff  0x00007fd7fa3c3cf5 
0x000000c0001227b0:  0x0000000000000000  0x0000000000000001 
0x000000c0001227c0:  0x000000c00010a340  0x00000000005bcb50 
0x000000c0001227d0:  0x00007fd7fa408400  0x00007fd7fa3c4820 
0x000000c0001227e0:  0x00007fd7fa3c74a3  0x00007fd7fa408400 
0x000000c0001227f0:  0x0000000000000000  0x00007fd7fa3c6fca 
runtime.notetsleepg(0xffffffffffffffff?, 0x7fd7fa3c3cf5?)
        /workdir/go/src/runtime/lock_futex.go:236 +0x34 fp=0xc0001227a0 sp=0xc000122768 pc=0x415154
created by os/signal.Notify.func1.1
        /workdir/go/src/os/signal/signal.go:151 +0x2a

prattmic avatar Aug 17 '22 17:08 prattmic

FWIW, I've been unable to reproduce this with _StackLimit increased by 10x, which seems consistent with a stack overflow somewhere.

prattmic avatar Aug 17 '22 20:08 prattmic

I take that back. It look about an hour (instead of the usual ~5 minutes), but I did get a repro with 10 * _StackLimit.

prattmic avatar Aug 17 '22 21:08 prattmic

I've been making slow progress on this. The most notable is that this reproduces when running only TestSetgid and TestSetgidStress, while it does not reproduce while running only various other tests I've tried. (I haven't tried each test individually, as there are dozens and the repro time is a bit high). So this may be related to setgid, or just signals in general.

prattmic avatar Aug 19 '22 19:08 prattmic

It looks like the problem is that signal 34 (SIGRT_2) used by musl for setgid is not getting SA_ONSTACK set.

If I'm interpreting strace correctly, it looks like this signal is still SIG_DFL when Go queries (it would set SA_ONSTACK if a handler was already installed):

1184993 rt_sigaction(SIGRT_2, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0

Only later does musl install the signal handler:

1184993 rt_sigaction(SIGRT_2, {sa_handler=0x7f29efb24078, sa_mask=~[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f29efb15c9f},  <unfinished ...>

prattmic avatar Aug 19 '22 20:08 prattmic

musl does not install the SIGRT_2 handler until the first call to __synccall (actually, it reinstalls it on every call), which is why we don't know to add SA_ONSTACK at startup.

__synccall is used by setgroups, setrlimit, and setxid.

prattmic avatar Aug 19 '22 20:08 prattmic

reinstalls it on every call

(!)

That means even if we set SA_ONSTACK for their handler, they will reinstall and overwrite it?

cherrymui avatar Aug 19 '22 20:08 cherrymui

https://git.musl-libc.org/cgit/musl/tree/src/thread/synccall.c#n102

Does it mean that they remove the handler at exit of the call? Hm....

cherrymui avatar Aug 19 '22 20:08 cherrymui

Correct, they don't even try to match the existing flags or forward to an existing handler, so we can't install a dummy SA_ONSTACK handler.

Does it mean that they remove the handler at exit of the call?

Yes, that is what I see:

1184993 rt_sigaction(SIGRT_2, {sa_handler=0x7f29efb24078, sa_mask=~[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f29efb15c9f},  <unfinished ...>
1184993 tkill(1184998, SIGRT_2)         = 0
1184998 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184996, SIGRT_2 <unfinished ...>
1184996 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184995, SIGRT_2 <unfinished ...>
1184995 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184997, SIGRT_2 <unfinished ...>
1184997 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184994, SIGRT_2 <unfinished ...>
1184994 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 rt_sigaction(SIGRT_2, {sa_handler=SIG_IGN, sa_mask=~[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f29efb15c9f},  <unfinished ...>

prattmic avatar Aug 19 '22 20:08 prattmic

To summarize:

  • Go uses small goroutines stacks, so there is no guarantee that there is enough space on the stack for signal context and frame at all times.
  • To handle this, Go creates a separate signal stack for each thread installed with sigaltstack. All signal handlers must set SA_ONSTACK to use the signal stack and avoid smashing the goroutine stack.
  • To try to cooperate with libc, at startup Go inspects all signal handlers (even ones it doesn't care to handle), and adds SA_ONSTACK if it is not already set.
  • musl uses signal 34 for the various setxid calls, but does not install the handler at startup. Instead, it is temporarily installed on each call to the setxid functions (in __synccall).
  • As a result, Go never has a chance to add SA_ONSTACK.

I don't see how we can work around this in Go given that we can't adjust the signal handler flags, nor does __synccall respect flags from an existing signal handler. We would have to make goroutine stacks much larger, which would be a significant increase in stack allocations.

There are several changes on the musl side that could address this:

  • musl could install the signal 34 handler once at startup so that Go can adjust the flag.
  • Or, __synccall could query for an existing signal handler, and if it has SA_ONSTACK then keep that flag for their handler. In this case, Go would install a dummy signal 34 handler at startup just to expose SA_ONSTACK.
  • Or, even simpler, according to man 2 sigaction's SA_ONSTACK description: "If an alternate stack is not available, the default stack will be used." If this is accurate (I haven't verified), then __synccall could set SA_ONSTACK unconditionally, which would normally make no difference, but would use Go's sigaltstack when linked with Go.

prattmic avatar Aug 19 '22 20:08 prattmic

Ah, it turns out this is a duplicate of #39857, which has been discussed at some length but not resolved.

prattmic avatar Aug 19 '22 21:08 prattmic