Fix the error of runc doesn't work with go1.22
As the description in #4233, there is a bug in glibc, pthread_self()
will return wrong info after we do clone(CLONE_PARENT) in libct/nsenter,
it will cause runc can't work in go 1.22.*. So we use fork(2) to replace
clone(2) in libct/nsenter, but there is a double-fork in nsenter, so we
need to use PR_SET_CHILD_SUBREAPER to let runc can reap grand child
process in libct/nsenter.
Fix #4233
go 1.22.0 error msg:
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:202 libcontainer/dmz.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:177 libcontainer/dmz.CloneBinary() cloning runc-dmz binary (8736 bytes)
DEBU[0000]libcontainer/container_linux.go:537 libcontainer.(*Container).newParentProcess() runc-dmz: using runc-dmz
DEBU[0000] nsexec[42599]: => nsexec container setup
DEBU[0000] nsexec-0[42599]: ~> nsexec stage-0
DEBU[0000] nsexec-0[42599]: spawn stage-1
DEBU[0000] nsexec-0[42599]: -> stage-1 synchronisation loop
DEBU[0000] nsexec-1[42600]: ~> nsexec stage-1
DEBU[0000] nsexec-1[42600]: unshare remaining namespaces
DEBU[0000] nsexec-1[42600]: spawn stage-2
DEBU[0000] nsexec-1[42600]: request stage-0 to forward stage-2 pid (42601)
DEBU[0000] nsexec-0[42599]: stage-1 requested pid to be forwarded
DEBU[0000] nsexec-0[42599]: forward stage-1 (42600) and stage-2 (42601) pids to runc
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2
DEBU[0000] nsexec-1[42600]: signal completion to stage-0
DEBU[0000] nsexec-1[42600]: <~ nsexec stage-1
DEBU[0000]libcontainer/process_linux.go:457 libcontainer.(*initProcess).goCreateMountSources.func1() mount source thread: successfully running in container mntns
DEBU[0000] nsexec-0[42599]: stage-1 complete
DEBU[0000] nsexec-0[42599]: <- stage-1 synchronisation loop
DEBU[0000] nsexec-0[42599]: -> stage-2 synchronisation loop
DEBU[0000] nsexec-0[42599]: signalling stage-2 to run
DEBU[0000] nsexec-2[1]: signal completion to stage-0
DEBU[0000] nsexec-2[1]: <= nsexec container setup
DEBU[0000] nsexec-2[1]: booting up go runtime ...
DEBU[0000] nsexec-0[42599]: stage-2 complete
DEBU[0000] nsexec-0[42599]: <- stage-2 synchronisation loop
DEBU[0000] nsexec-0[42599]: <~ nsexec stage-0
DEBU[0000]libcontainer/sync.go:127 libcontainer.doReadSync() reading sync
DEBU[0000] sync pipe closed
DEBU[0000] mount source thread: closing thread: context canceled
ERRO[0000] runc run failed: unable to start container process: error during container init: procReady not received
As go has released v1.22.0, so there is no 1.20.x in
https://go.dev/dl/?mode=jsonanymore.
This can be fixed by adding &include=all (i.e. use https://go.dev/dl/?mode=json&include=all). I'll open a PR.
Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).
As go has released v1.22.0, so there is no 1.20.x in
https://go.dev/dl/?mode=jsonanymore.This can be fixed by adding
&include=all(i.e. use https://go.dev/dl/?mode=json&include=all). I'll open a PR.
Sorry, I was wrong about it. We should just switch to 1.21 in .cirrus.yml which is what you do.
We also need to fix this for Go 1.22
# (in test file tests/integration/spec.bats, line 37)
# `GO111MODULE=auto go get github.com/xeipuuv/gojsonschema' failed
# runc spec (status=0):
#
# Cloning into 'runtime-spec'...
# HEAD is now at 4fec88f merge #1219 into main
# go: go.mod file not found in current directory or any parent directory.
# 'go get' is no longer supported outside a module.
# To build and install a command, use 'go install' with a version,
# like 'go install example.com/cmd@latest'
# For more information, see https://golang.org/doc/go-get-install-deprecation
# or run 'go help get' or 'go help install'.
I don't remember why I haven't switched to go install, guess it's not as easy as it seems.
Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).
It seems that cgo may be broken with clone(2) in go1.22.0? https://github.com/golang/go/issues/65625 PTAL
Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).
It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625 PTAL
Again, I can't repro locally.
[kir@kir-tp1 cgoclone2]$ go version
go version go1.21.6 linux/amd64
[kir@kir-tp1 cgoclone2]$ go run main.go
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!
[kir@kir-tp1 cgoclone2]$ go1.22.0 version
go version go1.22.0 linux/amd64
[kir@kir-tp1 cgoclone2]$ go1.22.0 run main.go
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!
Maybe it's your kernel version @lifubang? Can you show uname -a?
@lifubang also if you can repro that (alas I can not), you can git bisect golang between 1.21.0 and 1.22.0.
Note in CI it happens with Ubuntu 20.04 but not Ubuntu 22.04. Will try to repro in a VM.
On Ubuntu 20.04, when running the binary compiled with go 1.22, I am seeing a SIGSEGV:
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xf8} ---
Can't yet figure out what's going on there; will continue tomorrow.
@lifubang I did a bisect, here are the results: https://github.com/golang/go/issues/65625#issuecomment-1935500218
Will continue tomorrow.
It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625
So, to summarize the investigation done there -- it's a glibc bug, in fact, two bugs:
pthread_self()returns wrong info after we do what we do in libct/nsenterpthread_getattr_np(pthread_self(), &attr)(which Go 1.22 calls internally) does a NULL pointer dereference, so the app gets SIGABRT.
These two bugs are apparently specific to glibc used by Ubuntu 20.04 (libc6 2.31-0ubuntu9.14) and maybe also Debian 10 (libc6 2.28-10+deb10u2), as I was able to reproduce on both. With Debian 10, it even prints error from free: free(): invalid pointer, maybe due to some extra Debian-specific patches, but still gets SIGABRT.
For some reason I was unable to repro on older Fedora (F32, glibc-2.31-2.fc32, F33, glibc-2.32-10.fc33) and Debian 11 (libc6 2.31-7).
The bad news is, every version of glibc has the bug 1 above, and https://go-review.googlesource.com/c/go/+/563379 may make it so go 1.22.x will fail runc init on every version of glibc.
Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.
Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.
👍
Rebasing this to re-run with Go 1.22.1
Sorry @lifubang I've high-jacked your PR, needed to run it with Go 1.22.1 and added missing changes to go.sum to fix failing CI (https://github.com/opencontainers/runc/actions/runs/8460901105/job/23179867537)
OK, Go 1.22.1 makes no difference. I guess we have to disable Go 1.22 for now.
I think we should open an new stack for the child process in clone(2). Please refer to https://man7.org/linux/man-pages/man2/clone.2.html
The stack argument specifies the location of the stack used by
the child process. Since the child and calling process may share
memory, it is not possible for the child process to execute in
the same stack as the calling process. The calling process must
therefore set up memory space for the child stack and pass a
pointer to this space to clone(). Stacks grow downward on all
processors that run Linux (except the HP PA processors), so stack
usually points to the topmost address of the memory space set up
for the child stack. Note that clone() does not provide a means
whereby the caller can inform the kernel of the size of the stack
area.
Could you give some suggestions for the failure CI? Thanks.
The special reason of failure of rootless(with idmap) is that we forked a child process and execve /usr/bin/newuidmapin the parent process.
So, we have resolved the issue of clone(CLONE_PARENT), but if add a fork, it still fail.
Everything is green now.
Another fork and PR_SET_CHILD_SUBREAPER solution, welcome suggestions.
Use fork(2) to replace clone(CLONE_PARENT) will cause runc more slower, I do a test in my machine, to start 100 containers, the total used time becomes 4.135s from 3.987s, which is increased by 148 ms, so it will increase 1.48 ms to start one container.
I don't see how switching to
forkwill fix the actual issue. The problem is thatpthread_self()will return stale data after aclone(2)-- can you point to the code in glibc which ensures that the data is valid?
It seems that there is no error for fork.
#define _GNU_SOURCE
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <limits.h>
#include <pthread.h>
#include <sched.h>
#include <setjmp.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <linux/limits.h>
#include <linux/netlink.h>
#include <linux/types.h>
#include <sched.h>
#define STAGE_SETUP -1
#define STAGE_PARENT 0
#define STAGE_CHILD 1
#define STAGE_INIT 2
int current_stage = STAGE_SETUP;
struct clone_t {
char stack[4096] __attribute__((aligned(16)));
char stack_ptr[0];
jmp_buf *env;
int jmpval;
};
static int child_func(void *arg) __attribute__((noinline));
static int child_func(void *arg)
{
struct clone_t *ca = (struct clone_t *)arg;
longjmp(*ca->env, ca->jmpval);
}
static int clone_parent(jmp_buf *env, int jmpval) __attribute__((noinline));
static int clone_parent(jmp_buf *env, int jmpval)
{
struct clone_t ca = {
.env = env,
.jmpval = jmpval,
};
pid_t pid = fork();
if (pid == 0)
return child_func(&ca);
return pid;
}
static void nsexec() {
jmp_buf env;
current_stage = setjmp(env);
switch (current_stage) {
case STAGE_PARENT: {
printf("STAGE_PARENT\n");
clone_parent(&env, STAGE_CHILD);
exit(0);
}
break;
case STAGE_CHILD: {
printf("STAGE_CHILD\n");
clone_parent(&env, STAGE_INIT);
exit(0);
}
break;
case STAGE_INIT: {
printf("STAGE_INIT\n");
}
break;
}
printf("This from nsexec\n");
return;
}
void __attribute__((constructor)) init(void) {
nsexec();
pthread_attr_t attr;
int ret = pthread_getattr_np(pthread_self(), &attr);
if (ret != 0) {
printf("pthread_getattr_np: %s\n", strerror(ret));
/* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
pthread_attr_destroy(&attr);
abort();
}
}
int main(void) {
printf("Hello from main!\n");
}
Does this patch fix the C version of the buggy program?
Yes, please see the code above.
It seems that there is no error for
fork.https://public-inbox.org/libc-alpha/[email protected]/
For those reading at home, this is https://github.com/bminor/glibc/commit/c579f48edba88380635ab98cb612030e3ed8691e. So, they removed the PID cache entirely and removed their explicit TID storage for clone.
After looking into it a bit more, the reason fork works is that glibc's fork implementation actually uses clone with CLONE_CHILD_SETTID so that the kernel will update the TLS value of the tid. See arch_fork and __tls_init_tp. (pthread-controlled threads are a little more complicated because they use CLONE_SETTLS.)
So, the issue is not with CLONE_PARENT, nor with clone(2) itself. The issue is that glibc only bothers to fill the tid field of pthread_self() if you are using fork(2). It seems strange they kept the TID cache -- if they had removed that too then clone(2) would still work...
There is a way for us to keep using clone(2) but it's quite nasty -- prctl(PR_GET_TID_ADDRESS) lets us get access to &THREAD_SELF->tid, which would let us keep using clone(2) with CLONE_CHILD_GETTID. Unfortunately that requires CONFIG_CHECKPOINT_RESTORE and is also quite ugly (and possibly fragile -- if glibc doesn't need CLONE_CHILD_SETTID anymore, this will break)...
I guess fork(2) is nicer than PR_GET_TID_ADDRESS. However it should be noted that glibc doesn't have to allow any pthread_* code to work after a fork(2) or clone(2). It's possible for a future glibc version to completely break some other part of runc init, because we are not following the rules in signal-safety(7). Just because fork(2) happens to work doesn't mean we solved the problem completely. The actual fix (as discussed in #4233) is to use execve to make sure we don't do anything to interfere with Go.
It seems that there is no error for fork.
For reference, I think the following test is better for checking the root cause of the issue (is the cached TID wrong?).
typedef struct pthread {
/* Based on output from gdb -- this might change on different machines. */
char __padding[720];
pid_t tid;
/* ... snip ... */
} pthread;
void __attribute__((constructor)) init(void)
{
nsexec();
pthread_t self = pthread_self();
pthread *THREAD_SELF = (pthread *)self;
printf("cached tid: %d ; actual tid: %d\n", THREAD_SELF->tid, gettid());
pthread_attr_t attr;
int ret = pthread_getattr_np(self, &attr);
if (ret != 0) {
printf("pthread_getattr_np: %s\n", strerror(ret));
/* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
pthread_attr_destroy(&attr);
abort();
}
}
It seems like doing it with
fork(2)is a better idea for the moment.
👍
I still think we need to switch to
execveto make sure this is safe though...
But maybe this will cause the issues like runc-dmz if we use execve in stage-2?
But maybe this will cause the issues like runc-dmz if we use execve in stage-2?
We still have full capabilities at the beginning of stage-2 (both with and without user namespaces) and haven't applied any LSM labels or anything like that. I wouldn't expect there to be any issues.
I still think we need to switch to
execveto make sure this is safe though...
I'm wondering, if doing execve of /proc/self/exec init-go or something will also serve as a replacement for runc-dmz (and, will it hit the same issues as runc-dmz, e.g. wrt selinux)?
Do you think moving the c code in stage-0 and stage-2 to golang code could fix this issue or not? I don't know whether the setjmp and longjmp could work in go or not, and it can using together with C or not?
Does it worth to try?
and it can using together with C or not?
I think maybe no, because after clone(2), it has already in go routine, it can’t longjmp to C.
So, it seems that there is no other way to fix this issue?
This patch also works, while still allowing us to use CLONE_PARENT. Yes, I'm sure we agree it's not lovely, but IMHO using fork() is depending on glibc internals just as much as this is. If glibc stops using CLONE_CHILD_CLEARTID then fork() will also stop working. The only downside of this approach is that it only works with CONFIG_CHECKPOINT_RESTORE=y but I suspect most people running with containers have that enabled.
diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index c771ac7e1165..319899bd9b71 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -15,6 +15,7 @@
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
+#include <pthread.h> /* _only_ used for pthread_self() in debug log */
#include <sys/ioctl.h>
#include <sys/prctl.h>
@@ -319,7 +320,41 @@ static int clone_parent(jmp_buf *env, int jmpval)
.jmpval = jmpval,
};
- return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
+ /*
+ * Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
+ * glibc no longer updates the TLS state containing the current process
+ * tid after clone(2). This results in stale TIDs being used when Go
+ * 1.22 and later call pthread_gettattr_np(pthread_self()), resulting
+ * in crashes on ancient glibcs and errors on newer glibcs.
+ *
+ * Luckily, because the same address is used for CLONE_PARENT_SETTID,
+ * we can poke around in glibc's internal cache by getting the address
+ * using PR_GET_TID_ADDRESS (only available in Linux >= 3.5, with
+ * CONFIG_CHECKPOINT_RESTORE=y) and then overwriting it with
+ * CLONE_CHILD_SETTID. CLONE_CHILD_CLEARTID is set to allow descendant
+ * PR_GET_TID_ADDRESS calls to work, as well as matching what glibc
+ * does in arch_fork().
+ *
+ * Yes, this is pretty horrific, but the core issue here is that we
+ * need to run Go code ("runc init") in the child after fork(), which
+ * is not allowed by glibc (see signal-safety(7)). We cannot exec to
+ * solve the problem because we are in a security critical situation
+ * here, and doing an exec would allow for container escapes (obvious
+ * issues include that the shared libraries loaded from a re-exec would
+ * come from the container, and doing an exec here would clear the bit
+ * that makes non-dumpable flags effective for userns containers with
+ * CAP_SYS_PTRACE).
+ */
+ pid_t *tid_addr = NULL;
+ if (prctl(PR_GET_TID_ADDRESS, &tid_addr) < 0)
+ /* what should we do here... */;
+ write_log(DEBUG, "nsenter clone: get_tid_address gave us %p (pthread_self=%p)", tid_addr, (void *) pthread_self());
+ if (!tid_addr || *tid_addr != gettid())
+ write_log(WARNING, "nsenter clone: glibc private tid address is wrong: *%p %d != gettid() %d", tid_addr, tid_addr ? *tid_addr : -1, gettid());
+
+ return clone(child_func, ca.stack_ptr,
+ CLONE_PARENT | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, &ca,
+ NULL /* parent_tid */ , NULL /* tls */ , tid_addr);
}
/* Returns the clone(2) flag for a namespace, given the name of a namespace. */
@kolyshkin wdyt?