runc icon indicating copy to clipboard operation
runc copied to clipboard

runc doesn't work with go1.22

Open cyphar opened this issue 1 year ago • 1 comments
trafficstars

I'm opening this to act as a tracking issue for the go1.22 issue.

I suspect the only bullet-proof solution is going to be adding another re-exec after the C code finishes to re-exec the Go side of runc init (which will make runc even slower...). It really is quite unfortunate that Go's stdlib doesn't provide a lot of knobs we need to remove nsenter, and I suspect we will never be able to full switch away from nsenter...


It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625

So, to summarize the investigation done there -- it's a glibc bug, in fact, two bugs:

  1. pthread_self() returns wrong info after we do what we do in libct/nsenter
  2. pthread_getattr_np(pthread_self(), &attr) (which Go 1.22 calls internally) does a NULL pointer dereference, so the app gets SIGABRT.

These two bugs are apparently specific to glibc used by Ubuntu 20.04 (libc6 2.31-0ubuntu9.14) and maybe also Debian 10 (libc6 2.28-10+deb10u2), as I was able to reproduce on both. With Debian 10, it even prints error from free: free(): invalid pointer, maybe due to some extra Debian-specific patches, but still gets SIGABRT.

For some reason I was unable to repro on older Fedora (F32, glibc-2.31-2.fc32, F33, glibc-2.32-10.fc33) and Debian 11 (libc6 2.31-7).

The bad news is, every version of glibc has the bug 1 above, and https://go-review.googlesource.com/c/go/+/563379 may make it so go 1.22.x will fail runc init on every version of glibc.

Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.

Originally posted by @kolyshkin in https://github.com/opencontainers/runc/issues/4193#issuecomment-1947700151

cyphar avatar Mar 29 '24 01:03 cyphar

Copying my comment from https://github.com/opencontainers/runc/pull/4234#issuecomment-2029874782, which explains my current thinking on the issue.

I think the issue is more likely related to the thread-local storage used by pthreads. From golang/go#65625 it seems that the old TID is stored in the TLS and glibc then tries to do sched_getaffinity on that TID -- this just so happens to work in runc's case because the parent process is still alive and so pthreads gets the wrong process's information (oops). I'm a little surprised this is the case -- gettid() works correctly after fork() and while all of the pthread_* stuff is theoretically unsafe to call after fork() it's very weird that they would allow the pthread_* stuff to get outdated if they already put the work into making gettid() work (not to mention IIRC gettid() is cached in thread-local storage). Also the debugging work in golang/go#65625 indicated that it was an issue with CLONE_PARENT in particular, which makes less sense -- if the problem is outdated TLS data, I would expect the issue to occur regardless of what the parent PID is.

Ultimately, our usage of clone is technically not "safe" from a UB perspective because we aren't meant to call any async-unsafe code in a child after a fork. I suspect this is the core issue.

cyphar avatar Apr 04 '24 01:04 cyphar

Opened this: https://go-review.googlesource.com/c/go/+/587919, fingers crossed.

kolyshkin avatar May 23 '24 21:05 kolyshkin

Made a proper cherry-pick to Go 1.22 (https://github.com/golang/go/issues/67650). If approved, this will be included into Go 1.22.4.

kolyshkin avatar May 25 '24 20:05 kolyshkin

Great news, so runc 1.2.0 is ready for prime time once Go 1.22.4 is released?

MaxXor avatar Jun 05 '24 07:06 MaxXor

Go 1.23 includes a fix (https://go.dev/cl/587919) so it can be used. This fix is also backported to Go 1.22.4. If top-level Makefile sees Go 1.22.x where x >= 4, it adds -DGO1_22_WORKS to GO_CPPFLAGS.

kolyshkin avatar Jun 07 '24 16:06 kolyshkin

Fixed in:

  • #4292

AkihiroSuda avatar Jun 08 '24 09:06 AkihiroSuda