runc Fix the error of runc doesn't work with go1.22

As the description in #4233, there is a bug in glibc, pthread_self() will return wrong info after we do clone(CLONE_PARENT) in libct/nsenter, it will cause runc can't work in go 1.22.*. So we use fork(2) to replace clone(2) in libct/nsenter, but there is a double-fork in nsenter, so we need to use PR_SET_CHILD_SUBREAPER to let runc can reap grand child process in libct/nsenter.

Fix #4233

Feb 07 '24 15:02 lifubang

go 1.22.0 error msg:

DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:202 libcontainer/dmz.IsCloned() F_GET_SEALS on /proc/self/exe failed: invalid argument 
DEBU[0000]libcontainer/dmz/cloned_binary_linux.go:177 libcontainer/dmz.CloneBinary() cloning runc-dmz binary (8736 bytes)         
DEBU[0000]libcontainer/container_linux.go:537 libcontainer.(*Container).newParentProcess() runc-dmz: using runc-dmz                     
DEBU[0000] nsexec[42599]: => nsexec container setup     
DEBU[0000] nsexec-0[42599]: ~> nsexec stage-0           
DEBU[0000] nsexec-0[42599]: spawn stage-1               
DEBU[0000] nsexec-0[42599]: -> stage-1 synchronisation loop 
DEBU[0000] nsexec-1[42600]: ~> nsexec stage-1           
DEBU[0000] nsexec-1[42600]: unshare remaining namespaces 
DEBU[0000] nsexec-1[42600]: spawn stage-2               
DEBU[0000] nsexec-1[42600]: request stage-0 to forward stage-2 pid (42601) 
DEBU[0000] nsexec-0[42599]: stage-1 requested pid to be forwarded 
DEBU[0000] nsexec-0[42599]: forward stage-1 (42600) and stage-2 (42601) pids to runc 
DEBU[0000] nsexec-2[1]: ~> nsexec stage-2               
DEBU[0000] nsexec-1[42600]: signal completion to stage-0 
DEBU[0000] nsexec-1[42600]: <~ nsexec stage-1           
DEBU[0000]libcontainer/process_linux.go:457 libcontainer.(*initProcess).goCreateMountSources.func1() mount source thread: successfully running in container mntns 
DEBU[0000] nsexec-0[42599]: stage-1 complete            
DEBU[0000] nsexec-0[42599]: <- stage-1 synchronisation loop 
DEBU[0000] nsexec-0[42599]: -> stage-2 synchronisation loop 
DEBU[0000] nsexec-0[42599]: signalling stage-2 to run   
DEBU[0000] nsexec-2[1]: signal completion to stage-0    
DEBU[0000] nsexec-2[1]: <= nsexec container setup       
DEBU[0000] nsexec-2[1]: booting up go runtime ...       
DEBU[0000] nsexec-0[42599]: stage-2 complete            
DEBU[0000] nsexec-0[42599]: <- stage-2 synchronisation loop 
DEBU[0000] nsexec-0[42599]: <~ nsexec stage-0           
DEBU[0000]libcontainer/sync.go:127 libcontainer.doReadSync() reading sync                                 
DEBU[0000] sync pipe closed                             
DEBU[0000] mount source thread: closing thread: context canceled 
ERRO[0000] runc run failed: unable to start container process: error during container init: procReady not received

Feb 07 '24 16:02 lifubang

As go has released v1.22.0, so there is no 1.20.x in https://go.dev/dl/?mode=json anymore.

This can be fixed by adding &include=all (i.e. use https://go.dev/dl/?mode=json&include=all). I'll open a PR.

Feb 07 '24 21:02 kolyshkin

Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).

Feb 07 '24 21:02 kolyshkin

As go has released v1.22.0, so there is no 1.20.x in https://go.dev/dl/?mode=json anymore.

This can be fixed by adding &include=all (i.e. use https://go.dev/dl/?mode=json&include=all). I'll open a PR.

Sorry, I was wrong about it. We should just switch to 1.21 in .cirrus.yml which is what you do.

Feb 07 '24 21:02 kolyshkin

We also need to fix this for Go 1.22

# (in test file tests/integration/spec.bats, line 37)
#   `GO111MODULE=auto go get github.com/xeipuuv/gojsonschema' failed
# runc spec (status=0):
#
# Cloning into 'runtime-spec'...
# HEAD is now at 4fec88f merge #1219 into main
# go: go.mod file not found in current directory or any parent directory.
# 	'go get' is no longer supported outside a module.
# 	To build and install a command, use 'go install' with a version,
# 	like 'go install example.com/cmd@latest'
# 	For more information, see https://golang.org/doc/go-get-install-deprecation
# 	or run 'go help get' or 'go help install'.

I don't remember why I haven't switched to go install, guess it's not as easy as it seems.

Feb 07 '24 21:02 kolyshkin

Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).

It seems that cgo may be broken with clone(2) in go1.22.0? https://github.com/golang/go/issues/65625 PTAL

Feb 09 '24 05:02 lifubang

Interestingly, both runc 1.1.12 and runc from git HEAD built with go1.22.0 work fine on my machine (all tests are passing).

It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625 PTAL

Again, I can't repro locally.

[kir@kir-tp1 cgoclone2]$ go version
go version go1.21.6 linux/amd64
[kir@kir-tp1 cgoclone2]$ go run main.go 
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!
[kir@kir-tp1 cgoclone2]$ go1.22.0 version
go version go1.22.0 linux/amd64
[kir@kir-tp1 cgoclone2]$ go1.22.0 run main.go 
STAGE_PARENT
STAGE_CHILD
STAGE_INIT
This from nsexec
From main!

Maybe it's your kernel version @lifubang? Can you show uname -a?

Feb 09 '24 06:02 kolyshkin

@lifubang also if you can repro that (alas I can not), you can git bisect golang between 1.21.0 and 1.22.0.

Feb 09 '24 06:02 kolyshkin

Note in CI it happens with Ubuntu 20.04 but not Ubuntu 22.04. Will try to repro in a VM.

Feb 09 '24 06:02 kolyshkin

On Ubuntu 20.04, when running the binary compiled with go 1.22, I am seeing a SIGSEGV:

--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xf8} ---

Can't yet figure out what's going on there; will continue tomorrow.

Feb 09 '24 07:02 kolyshkin

@lifubang I did a bisect, here are the results: https://github.com/golang/go/issues/65625#issuecomment-1935500218

Will continue tomorrow.

Feb 09 '24 08:02 kolyshkin

It seems that cgo may be broken with clone(2) in go1.22.0? golang/go#65625

So, to summarize the investigation done there -- it's a glibc bug, in fact, two bugs:

pthread_self() returns wrong info after we do what we do in libct/nsenter
pthread_getattr_np(pthread_self(), &attr) (which Go 1.22 calls internally) does a NULL pointer dereference, so the app gets SIGABRT.

These two bugs are apparently specific to glibc used by Ubuntu 20.04 (libc6 2.31-0ubuntu9.14) and maybe also Debian 10 (libc6 2.28-10+deb10u2), as I was able to reproduce on both. With Debian 10, it even prints error from free: free(): invalid pointer, maybe due to some extra Debian-specific patches, but still gets SIGABRT.

For some reason I was unable to repro on older Fedora (F32, glibc-2.31-2.fc32, F33, glibc-2.32-10.fc33) and Debian 11 (libc6 2.31-7).

The bad news is, every version of glibc has the bug 1 above, and https://go-review.googlesource.com/c/go/+/563379 may make it so go 1.22.x will fail runc init on every version of glibc.

Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.

Feb 16 '24 03:02 kolyshkin

Meaning, we need a workaround for that. Perhaps changing runc libct/nsenter logic in some radical way, so that pthread_self works.

👍

Mar 04 '24 22:03 AkihiroSuda

Rebasing this to re-run with Go 1.22.1

Mar 28 '24 01:03 kolyshkin

Sorry @lifubang I've high-jacked your PR, needed to run it with Go 1.22.1 and added missing changes to go.sum to fix failing CI (https://github.com/opencontainers/runc/actions/runs/8460901105/job/23179867537)

Mar 28 '24 01:03 kolyshkin

OK, Go 1.22.1 makes no difference. I guess we have to disable Go 1.22 for now.

Mar 28 '24 01:03 kolyshkin

I think we should open an new stack for the child process in clone(2). Please refer to https://man7.org/linux/man-pages/man2/clone.2.html

       The stack argument specifies the location of the stack used by
       the child process.  Since the child and calling process may share
       memory, it is not possible for the child process to execute in
       the same stack as the calling process.  The calling process must
       therefore set up memory space for the child stack and pass a
       pointer to this space to clone().  Stacks grow downward on all
       processors that run Linux (except the HP PA processors), so stack
       usually points to the topmost address of the memory space set up
       for the child stack.  Note that clone() does not provide a means
       whereby the caller can inform the kernel of the size of the stack
       area.

Mar 30 '24 16:03 lifubang

Could you give some suggestions for the failure CI? Thanks.

Mar 30 '24 16:03 lifubang

The special reason of failure of rootless(with idmap) is that we forked a child process and execve /usr/bin/newuidmapin the parent process. So, we have resolved the issue of clone(CLONE_PARENT), but if add a fork, it still fail.

Mar 31 '24 09:03 lifubang

Everything is green now. Another fork and PR_SET_CHILD_SUBREAPER solution, welcome suggestions.

Apr 03 '24 17:04 lifubang

Use fork(2) to replace clone(CLONE_PARENT) will cause runc more slower, I do a test in my machine, to start 100 containers, the total used time becomes 4.135s from 3.987s, which is increased by 148 ms, so it will increase 1.48 ms to start one container.

Apr 04 '24 00:04 lifubang

I don't see how switching to fork will fix the actual issue. The problem is that pthread_self() will return stale data after a clone(2) -- can you point to the code in glibc which ensures that the data is valid?

It seems that there is no error for fork.

#define _GNU_SOURCE
#include <endian.h>
#include <errno.h>
#include <fcntl.h>
#include <grp.h>
#include <limits.h>
#include <pthread.h>
#include <sched.h>
#include <setjmp.h>
#include <signal.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>

#include <sys/ioctl.h>
#include <sys/prctl.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <sys/wait.h>

#include <linux/limits.h>
#include <linux/netlink.h>
#include <linux/types.h>

#include <sched.h>

#define STAGE_SETUP  -1
#define STAGE_PARENT  0
#define STAGE_CHILD   1
#define STAGE_INIT    2

int current_stage = STAGE_SETUP;

struct clone_t {
        char stack[4096] __attribute__((aligned(16)));
        char stack_ptr[0];

        jmp_buf *env;
        int jmpval;
};

static int child_func(void *arg) __attribute__((noinline));
static int child_func(void *arg)
{
        struct clone_t *ca = (struct clone_t *)arg;
        longjmp(*ca->env, ca->jmpval);
}

static int clone_parent(jmp_buf *env, int jmpval) __attribute__((noinline));
static int clone_parent(jmp_buf *env, int jmpval)
{
        struct clone_t ca = {
                .env = env,
                .jmpval = jmpval,
        };

        pid_t pid = fork();
        if (pid == 0)
                return child_func(&ca);
        return pid;
}

static void nsexec() {
        jmp_buf env;
    current_stage = setjmp(env);
    switch (current_stage) {
                case STAGE_PARENT: {
                        printf("STAGE_PARENT\n");
                        clone_parent(&env, STAGE_CHILD);
                        exit(0);
                }
                break;
                case STAGE_CHILD: {
                        printf("STAGE_CHILD\n");
                        clone_parent(&env, STAGE_INIT);
                        exit(0);
                }
                break;
                case STAGE_INIT: {
                        printf("STAGE_INIT\n");
                }
                break;
        }

        printf("This from nsexec\n");
        return;
}

void __attribute__((constructor)) init(void) {
        nsexec();
        pthread_attr_t attr;
        int ret = pthread_getattr_np(pthread_self(), &attr);
        if (ret != 0) {
                printf("pthread_getattr_np: %s\n", strerror(ret));
                /* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
                pthread_attr_destroy(&attr);
                abort();
        }
}

int main(void) {
        printf("Hello from main!\n");
}

Does this patch fix the C version of the buggy program?

Yes, please see the code above.

Apr 04 '24 02:04 lifubang

It seems that there is no error for fork.

https://public-inbox.org/libc-alpha/[email protected]/

Apr 04 '24 03:04 lifubang

It seems that there is no error for fork.

https://public-inbox.org/libc-alpha/[email protected]/

For those reading at home, this is https://github.com/bminor/glibc/commit/c579f48edba88380635ab98cb612030e3ed8691e. So, they removed the PID cache entirely and removed their explicit TID storage for clone.

After looking into it a bit more, the reason fork works is that glibc's fork implementation actually uses clone with CLONE_CHILD_SETTID so that the kernel will update the TLS value of the tid. See arch_fork and __tls_init_tp. (pthread-controlled threads are a little more complicated because they use CLONE_SETTLS.)

So, the issue is not with CLONE_PARENT, nor with clone(2) itself. The issue is that glibc only bothers to fill the tid field of pthread_self() if you are using fork(2). It seems strange they kept the TID cache -- if they had removed that too then clone(2) would still work...

There is a way for us to keep using clone(2) but it's quite nasty -- prctl(PR_GET_TID_ADDRESS) lets us get access to &THREAD_SELF->tid, which would let us keep using clone(2) with CLONE_CHILD_GETTID. Unfortunately that requires CONFIG_CHECKPOINT_RESTORE and is also quite ugly (and possibly fragile -- if glibc doesn't need CLONE_CHILD_SETTID anymore, this will break)...

I guess fork(2) is nicer than PR_GET_TID_ADDRESS. However it should be noted that glibc doesn't have to allow any pthread_* code to work after a fork(2) or clone(2). It's possible for a future glibc version to completely break some other part of runc init, because we are not following the rules in signal-safety(7). Just because fork(2) happens to work doesn't mean we solved the problem completely. The actual fix (as discussed in #4233) is to use execve to make sure we don't do anything to interfere with Go.

It seems that there is no error for fork.

For reference, I think the following test is better for checking the root cause of the issue (is the cached TID wrong?).

typedef struct pthread {
	/* Based on output from gdb -- this might change on different machines. */
	char __padding[720];

	pid_t tid;

	/* ... snip ... */
} pthread;

void __attribute__((constructor)) init(void)
{
		nsexec();
		pthread_t self = pthread_self();

		pthread *THREAD_SELF = (pthread *)self;
		printf("cached tid: %d ; actual tid: %d\n", THREAD_SELF->tid, gettid());

		pthread_attr_t attr;
		int ret = pthread_getattr_np(self, &attr);
		if (ret != 0) {
				printf("pthread_getattr_np: %s\n", strerror(ret));
				/* Try to destroy attr anyway. Bad idea, because getattr fails, but this is what Go does. */
				pthread_attr_destroy(&attr);
				abort();
		}
}

Apr 04 '24 06:04 cyphar

It seems like doing it with fork(2) is a better idea for the moment.

👍

I still think we need to switch to execve to make sure this is safe though...

But maybe this will cause the issues like runc-dmz if we use execve in stage-2?

Apr 04 '24 13:04 lifubang

But maybe this will cause the issues like runc-dmz if we use execve in stage-2?

We still have full capabilities at the beginning of stage-2 (both with and without user namespaces) and haven't applied any LSM labels or anything like that. I wouldn't expect there to be any issues.

Apr 05 '24 07:04 cyphar

I still think we need to switch to execve to make sure this is safe though...

I'm wondering, if doing execve of /proc/self/exec init-go or something will also serve as a replacement for runc-dmz (and, will it hit the same issues as runc-dmz, e.g. wrt selinux)?

Apr 05 '24 22:04 kolyshkin

Do you think moving the c code in stage-0 and stage-2 to golang code could fix this issue or not? I don't know whether the setjmp and longjmp could work in go or not, and it can using together with C or not? Does it worth to try?

Apr 11 '24 23:04 lifubang

and it can using together with C or not?

I think maybe no, because after clone(2), it has already in go routine, it can’t longjmp to C.

So, it seems that there is no other way to fix this issue?

Apr 11 '24 23:04 lifubang

This patch also works, while still allowing us to use CLONE_PARENT. Yes, I'm sure we agree it's not lovely, but IMHO using fork() is depending on glibc internals just as much as this is. If glibc stops using CLONE_CHILD_CLEARTID then fork() will also stop working. The only downside of this approach is that it only works with CONFIG_CHECKPOINT_RESTORE=y but I suspect most people running with containers have that enabled.

diff --git a/libcontainer/nsenter/nsexec.c b/libcontainer/nsenter/nsexec.c
index c771ac7e1165..319899bd9b71 100644
--- a/libcontainer/nsenter/nsexec.c
+++ b/libcontainer/nsenter/nsexec.c
@@ -15,6 +15,7 @@
 #include <stdbool.h>
 #include <string.h>
 #include <unistd.h>
+#include <pthread.h> /* _only_ used for pthread_self() in debug log */
 
 #include <sys/ioctl.h>
 #include <sys/prctl.h>
@@ -319,7 +320,41 @@ static int clone_parent(jmp_buf *env, int jmpval)
 		.jmpval = jmpval,
 	};
 
-	return clone(child_func, ca.stack_ptr, CLONE_PARENT | SIGCHLD, &ca);
+	/*
+	 * Since glibc 2.25 (see c579f48edba88380635ab98cb612030e3ed8691e),
+	 * glibc no longer updates the TLS state containing the current process
+	 * tid after clone(2). This results in stale TIDs being used when Go
+	 * 1.22 and later call pthread_gettattr_np(pthread_self()), resulting
+	 * in crashes on ancient glibcs and errors on newer glibcs.
+	 *
+	 * Luckily, because the same address is used for CLONE_PARENT_SETTID,
+	 * we can poke around in glibc's internal cache by getting the address
+	 * using PR_GET_TID_ADDRESS (only available in Linux >= 3.5, with
+	 * CONFIG_CHECKPOINT_RESTORE=y) and then overwriting it with
+	 * CLONE_CHILD_SETTID. CLONE_CHILD_CLEARTID is set to allow descendant
+	 * PR_GET_TID_ADDRESS calls to work, as well as matching what glibc
+	 * does in arch_fork().
+	 *
+	 * Yes, this is pretty horrific, but the core issue here is that we
+	 * need to run Go code ("runc init") in the child after fork(), which
+	 * is not allowed by glibc (see signal-safety(7)). We cannot exec to
+	 * solve the problem because we are in a security critical situation
+	 * here, and doing an exec would allow for container escapes (obvious
+	 * issues include that the shared libraries loaded from a re-exec would
+	 * come from the container, and doing an exec here would clear the bit
+	 * that makes non-dumpable flags effective for userns containers with
+	 * CAP_SYS_PTRACE).
+	 */
+	pid_t *tid_addr = NULL;
+	if (prctl(PR_GET_TID_ADDRESS, &tid_addr) < 0)
+		/* what should we do here... */;
+	write_log(DEBUG, "nsenter clone: get_tid_address gave us %p (pthread_self=%p)", tid_addr, (void *) pthread_self());
+	if (!tid_addr || *tid_addr != gettid())
+		write_log(WARNING, "nsenter clone: glibc private tid address is wrong: *%p %d != gettid() %d", tid_addr, tid_addr ? *tid_addr : -1, gettid());
+
+	return clone(child_func, ca.stack_ptr,
+		     CLONE_PARENT | CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID | SIGCHLD, &ca,
+		     NULL /* parent_tid */ , NULL /* tls */ , tid_addr);
 }
 
 /* Returns the clone(2) flag for a namespace, given the name of a namespace. */

@kolyshkin wdyt?

Apr 13 '24 15:04 cyphar