runc nsenter: overwrite glibc's internal tid cache on clone()

An alternative to #4193.

I will also take a look in a separate PR as to whether we can remove a stage from our three-stage structure (such as moving the mapping requests to the Go parent, maybe?) to remove the need to PR_SET_CHILD_SUBREAPER in #4193. If we do it that way then switching to fork() would be far less of an issue.

Since glibc 2.25, the thread-local cache of the current TID is no longer updated in the child when calling clone(2). This results in very unfortunate behaviour when Go does pthread calls using pthread_self(), which has the wrong TID stored.

The "simple" solution is to forcefully overwrite this cached value. Unfortunately (and unsurprisingly), the layout of "struct pthread" is strictly private and can change without warning.

Luckily, glibc (currently) uses CLONE_CHILD_CLEARTID for all forks (with the child_tid set to the cached &PTHREAD_SELF->tid), meaning that as long as runc is using glibc, when "runc init" is spawned the child process will have a pointer directly to the cached value we want to change. With CONFIG_CHECKPOINT_RESTORE=y kernels on Linux 3.5 and later, we can simply use prctl(PR_GET_TID_ADDRESS). For older kernels we need to memory scan the TLS structure (pthread_self() returns a pointer to the start of the structure so we can "just" scan it for a field containing the current TID and assume that it is the correct field).

Obviously this is all very horrific, and if you are reading this in the future, it almost certainly has caused some horrific bug that I did not forsee. Sorry about that. As far as I can tell, there is no other workable solution that doesn't also depend on the CLONE_CHILD_CLEARTID behaviour of glibc in some way. We cannot "just" do a re-exec after clone(2) for security reasons.

Fixes #4233 Signed-off-by: Aleksa Sarai [email protected]

Apr 14 '24 17:04 cyphar

Since glibc 2.25,

So maybe we can ifdef the kludge with something like __GLIBC_PREREQ(2, 25)

Apr 14 '24 19:04 kolyshkin

Was thinking of using a list of known offsets first, before resorting to linear scan. Can try different distros to get those offsets (I'm pretty sure there are only 2 or 3).

Yeah, I was thinking of that too but wasn't sure how variable all the structures are before PTHREAD_SELF->tid. However, looking at it again, apparently tcbhead_t's size is required to stay the same because of AddressSanitizer, so there might only be a couple of offsets we need in almost all cases. Then again, tcbhead_t is arch-specific (and the AddressSanitizer comment is only in sysdeps/x86_64/nptl/tls.h) -- though since quite a few have the same layout there's probably less than 10 we need...

512 as a upper size limit is also looking scary (withput haven't checked the actual struct pthread size).

Yeah, unfortunately we can't be sure of the size of struct pthread. On my machine the tid is at pthread_self+0x2d0 (index 180) so my initial guess of 256 felt a little small. But looking at the definitions of tcbhead_t, it seems amd64 has the largest structure of any architecture and so 180 is probably about the largest offset we could ever get.

Maybe move the kludge to a separate function, and #ifdef it out unless we're using GLIBC.

Will do.

So maybe we can ifdef the kludge with something like __GLIBC_PREREQ(2, 25)

What matters is the runtime version, so we'd need to parse gnu_get_libc_version if we really want to gate it. glibc has used CLONE_CHILD_CLEARTID since at least 2003 so the hack will work for ancient glibcs. I'm not sure if parsing gnu_get_libc_version is nicer.

Apr 15 '24 07:04 cyphar

@kolyshkin I went with writing dummy versions of the structure layouts so we can just use sizeof and offsetof. Would you prefer it just be a list of magic constants (which would be much shorter, but would need some care taken for pointer sizes for architectures where there are 32- and 64-bit versions)?

Apr 15 '24 15:04 cyphar

On musl, it seems they:

Use CLONE_CHILD_CLEARTID for some TLS magic to indicate when a thread has died, which means that PR_GET_TID_ADDRESS will return the wrong address.
Don't allow setting CLONE_CHILD_CLEARTID for the clone syscall.

So for musl we would need to do the linear scan and then set the address manually using set_tid_address(2). While Go does support musl builds, should we put any effort into making sure this hack works with musl? AFAICS the pthreads_getattr_np(pthread_self(), ...) issue will also apply to musl? @kolyshkin ?

Apr 17 '24 13:04 cyphar

So for musl we would need to do the linear scan and then set the address manually using set_tid_address(2). While Go does support musl builds, should we put any effort into making sure this hack works with musl? AFAICS the pthreads_getattr_np(pthread_self(), ...) issue will also apply to musl? @kolyshkin ?

I did some bare bone testing using the following Dockerfile (had to use alpine edge as the latest tagged version still uses go 1.21):

FROM alpine:edge
RUN apk add --no-cache git make go curl gawk iptables pkgconf python3 sshfs sudo iproute2 bats libseccomp-dev util-linux vim jq
RUN git config --global --add safe.directory /go/src/github.com/opencontainers/runc

WORKDIR /go/src/github.com/opencontainers/runc

With the above Dockerfile, I built runc and ran some tests (make all && bats tests/integration/exec.bats), for some reason it works (found some other non-related bugs, will send a PR or two). Will dig deeper later; maybe we also need to add a CI job doing that.

For now, I think, it does not make sense to fix what's not broken.

Apr 18 '24 19:04 kolyshkin

@kolyshkin It probably works because Go 1.22 still ignores the error from pthread_getattr_np and so it continues running as though the pthread request worked. The proposed patch in the original Go issue thread would cause an error to panic, but they haven't merged it yet.

I would also prefer to not merge this code if it isn't necessary, but I'm worried there is the possibility of some part of Go's runtime hitting an issue on production machines. Then again, as far as I can tell from looking into it, in practice things seem to just happen to work out.

pthread_getattr_np is used to determine the bounds of the stack (see src/runtime/cgo/gcc_stack_unix.c) for two purposes:

To handle a possible case where CGo gets called on a thread where the Go runtime hasn't run (see src/runtime/cgocall.go).
To figure out what stack size is needed for threads when the runtime spawns threads in CGo-enabled Go processes (see src/runtime/cgo/gcc_linux.c).

As far as I can tell, both uses just happen to work:

We are never in a situation where callbackUpdateSystemStack will be called, but in addition callbackUpdateSystemStack assumes that _cgo_getstackbound can fail silently (as it does on Windows) and so it has some reasonable fallbacks that it uses if the returned stacks are NULL.
pthread_create will pick reasonable defaults when called with an attribute struct that has a zero-sized stack size configured (see allocate_stack).

Now, it is entirely possible that some other aspect of Go could have issues with our clone setup. But I agree that unless they merge the PR where they will panic Go if pthread_getattr_np(pthread_self(), ...) returns an error, this patch is probably not necessary...

Apr 20 '24 06:04 cyphar

The only question remains, do we need this kludge for all Go versions of only for Go 1.22+. I tentatively vote for the latter, based on "don't fix if not broken" principle.

Can do this adding a go file like this:

//go:build go1.22

package nsenter

// #cgo CFLAGS: -DRUNC_GLIBC_TID_KLUDGE=1
import "C"

Apr 23 '24 01:04 kolyshkin

@cyphar shall we move this out of draft?

Apr 26 '24 18:04 kolyshkin

Is this still WIP?

May 09 '24 21:05 AkihiroSuda

Let's go ahead with #4292 for now.

Jun 06 '24 00:06 kolyshkin