runc
runc copied to clipboard
runc doesn't work with go1.22
I'm opening this to act as a tracking issue for the go1.22 issue.
I suspect the only bullet-proof solution is going to be adding another re-exec after the C code finishes to re-exec the Go side of runc init (which will make runc even slower...). It really is quite unfortunate that Go's stdlib doesn't provide a lot of knobs we need to remove nsenter, and I suspect we will never be able to full switch away from nsenter...
It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625
So, to summarize the investigation done there -- it's a glibc bug, in fact, two bugs:
pthread_self()returns wrong info after we do what we do in libct/nsenterpthread_getattr_np(pthread_self(), &attr)(which Go 1.22 calls internally) does a NULL pointer dereference, so the app gets SIGABRT.
These two bugs are apparently specific to glibc used by Ubuntu 20.04 (libc6 2.31-0ubuntu9.14) and maybe also Debian 10 (libc6 2.28-10+deb10u2), as I was able to reproduce on both. With Debian 10, it even prints error from free: free(): invalid pointer, maybe due to some extra Debian-specific patches, but still gets SIGABRT.
For some reason I was unable to repro on older Fedora (F32, glibc-2.31-2.fc32, F33, glibc-2.32-10.fc33) and Debian 11 (libc6 2.31-7).
The bad news is, every version of glibc has the bug 1 above, and https://go-review.googlesource.com/c/go/+/563379 may make it so go 1.22.x will fail runc init on every version of glibc.
Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.
Originally posted by @kolyshkin in https://github.com/opencontainers/runc/issues/4193#issuecomment-1947700151
Copying my comment from https://github.com/opencontainers/runc/pull/4234#issuecomment-2029874782, which explains my current thinking on the issue.
I think the issue is more likely related to the thread-local storage used by pthreads. From golang/go#65625 it seems that the old TID is stored in the TLS and glibc then tries to do
sched_getaffinityon that TID -- this just so happens to work in runc's case because the parent process is still alive and so pthreads gets the wrong process's information (oops). I'm a little surprised this is the case --gettid()works correctly afterfork()and while all of thepthread_*stuff is theoretically unsafe to call afterfork()it's very weird that they would allow thepthread_*stuff to get outdated if they already put the work into makinggettid()work (not to mention IIRCgettid()is cached in thread-local storage). Also the debugging work in golang/go#65625 indicated that it was an issue withCLONE_PARENTin particular, which makes less sense -- if the problem is outdated TLS data, I would expect the issue to occur regardless of what the parent PID is.Ultimately, our usage of
cloneis technically not "safe" from a UB perspective because we aren't meant to call any async-unsafe code in a child after a fork. I suspect this is the core issue.
Opened this: https://go-review.googlesource.com/c/go/+/587919, fingers crossed.
Made a proper cherry-pick to Go 1.22 (https://github.com/golang/go/issues/67650). If approved, this will be included into Go 1.22.4.
Great news, so runc 1.2.0 is ready for prime time once Go 1.22.4 is released?
Go 1.23 includes a fix (https://go.dev/cl/587919) so it can be used. This fix is also backported to Go 1.22.4. If top-level Makefile sees Go 1.22.x where x >= 4, it adds -DGO1_22_WORKS to GO_CPPFLAGS.
Fixed in:
- #4292