nerdctl icon indicating copy to clipboard operation
nerdctl copied to clipboard

Using nerdctl with rootless k3s

Open hinshun opened this issue 1 year ago • 11 comments

Description

Ideally I'd like to have a single rootless stack of k3s + containerd + image builder (e.g. buildkitd). We want to use nerdctl with the rootless k3s embedded containerd.

With the following upstream contributions:

  • https://github.com/k3s-io/k3s/pull/9308
  • https://github.com/k3s-io/k3s/pull/9309

We can now set the following to point nerdctl to the rootless k3s containerd:

export ROOTLESSKIT_STATE_DIR="$HOME/.rancher/k3s/rootless"
export CONTAINERD_ADDRESS="$XDG_RUNTIME_DIR/k3s/containerd/containerd.sock"
export CONTAINERD_NAMESPACE="k8s.io"

We can use several commands like nerdctl image ls successfully, but when attempting to run a container, it fails to readlink /proc/self/exe:

time="2024-02-16T07:32:44Z" level=debug msg="stateDir: /home/rootless/.rancher/k3s/rootless"
time="2024-02-16T07:32:44Z" level=debug msg="rootless parent main: executing \"/run/current-system/sw/bin/nsenter\" with [-r/ -w/home/rootless --preserve-credentials -m -n -U -t 1035 -F /run/current-system/sw/bin/nerdctl --debug-full run ghcr.io/pdtpartners/hello]"
time="2024-02-16T07:32:44Z" level=warning msg="cannot call os.Executable(), assuming the executable to be \"/run/current-system/sw/bin/nerdctl\"" error="readlink /proc/self/exe: no such file or directory"
time="2024-02-16T07:32:44Z" level=debug msg="verifying process skipped"
time="2024-02-16T07:32:46Z" level=debug msg="Failed to unmount snapshot \"/tmp/initialC3943988990\""
time="2024-02-16T07:32:46Z" level=fatal msg="readlink /proc/self/exe: no such file or directory"

I tried getting around it by patching nerdctl:

diff --git a/pkg/cmd/container/create.go b/pkg/cmd/container/create.go
index ca40bbe4..d205be26 100644
--- a/pkg/cmd/container/create.go
+++ b/pkg/cmd/container/create.go
@@ -406,7 +406,8 @@ func withBindMountHostIPC(_ context.Context, _ oci.Client, _ *containers.Contain
 func GenerateLogURI(dataStore string) (*url.URL, error) {
 	selfExe, err := os.Executable()
 	if err != nil {
-		return nil, err
+		log.L.WithError(err).Warnf("cannot call os.Executable(), assuming the executable to be %q", os.Args[0])
+		selfExe = os.Args[0]
 	}
 	args := map[string]string{
 		logging.MagicArgv1: dataStore,

It gets a little further but still have trouble with /proc/self/fd:

time="2024-02-16T07:39:21Z" level=debug msg="stateDir: /home/rootless/.rancher/k3s/rootless"
time="2024-02-16T07:39:21Z" level=debug msg="rootless parent main: executing \"/run/current-system/sw/bin/nsenter\" with [-r/ -w/home/rootless --preserve-credenn
tials -m -n -U -t 1019 -F /run/current-system/sw/bin/nerdctl --debug-full run ghcr.io/pdtpartners/hello]"
time="2024-02-16T07:39:21Z" level=warning msg="cannot call os.Executable(), assuming the executable to be \"/run/current-system/sw/bin/nerdctl\"" error="readlink /proc/self/exe: no such file or directory"
time="2024-02-16T07:39:21Z" level=debug msg="verifying process skipped"
time="2024-02-16T07:39:24Z" level=debug msg="Failed to unmount snapshot \"/tmp/initialC3550973183\""
time="2024-02-16T07:39:24Z" level=warning msg="cannot call os.Executable(), assuming the executable to be \"/run/current-system/sw/bin/nerdctl\"" error="readlinn
k /proc/self/exe: no such file or directory"
time="2024-02-16T07:39:24Z" level=debug msg="generated log driver: binary:///run/current-system/sw/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%%
2F4a156993"
time="2024-02-16T07:39:24Z" level=debug msg="remote introspection plugin filters" filters="[type==io.containerd.snapshotter.v1, id==nix]"
time="2024-02-16T07:39:24Z" level=fatal msg="failed to open stdout fifo: couldn't stat /proc/self/fd/7: stat /proc/self/fd/7: no such file or directory"

I'm speculating the root cause is because rootless k3s sets up a PIDNS (See: https://github.com/k3s-io/k3s/blob/v1.29.1%2Bk3s2/pkg/rootless/rootless.go#L144)? Although it is required for cgroupv2 evacuation.

Do you have any ideas? cc @AkihiroSuda

Describe the results you received and expected

Possible to run containers using nerdctl with rootless k3s containerd

What version of nerdctl are you using?

v1.7.0

hinshun avatar Feb 16 '24 07:02 hinshun

nerdctl should not (does not) enter the PIDNS setup by rootless k3s. Do you change the PID namespace of nerdctl at any point?

fahedouch avatar Feb 16 '24 16:02 fahedouch

nerdctl should not (does not) enter the PIDNS setup by rootless k3s. Do you change the PID namespace of nerdctl at any point?

I did not. This is failing with stock v1.7.0, my patch above was just for investigation. I see what you mean though, you’re saying for the rootless child /proc/self/exe should be available since it didn’t enter the PIDNS?

To be honest, I’m unfamiliar with what conditions where readlink /proc/self/exe could fail. I will provide a docker run environment for reproducing this.

hinshun avatar Feb 16 '24 22:02 hinshun

you’re saying for the rootless child /proc/self/exe should be available since it didn’t enter the PIDNS?

nope, I am saying that nerdctl do not enter the PIDNS. But the rootless child is entering the PIDNS. Here, it is a nerdctl issue so this may not related to PIDNS

fahedouch avatar Feb 17 '24 14:02 fahedouch

nope, I am saying that nerdctl do not enter the PIDNS. But the rootless child is entering the PIDNS. Here, it is a nerdctl issue so this may not related to PIDNS

I meant the rootless child of nerdctl, so I think we're saying the same thing!

I built & pushed a docker image to ghcr.io/pdtpartners/nix-snapshotter that reproduces the issue. The entrypoint of the image launches a non-gui QEMU VM with NixOS with rootless k3s in a systemd user service:

docker run --rm -it ghcr.io/pdtpartners/nix-snapshotter:rootless

nixos login: rootless # (Ctrl-a then x to quit)
Password: rootless

[rootless@nixos:~]$ nerdctl run --debug-full hello-world
DEBU[0000] stateDir: /home/rootless/.rancher/k3s/rootless
DEBU[0000] rootless parent main: executing "/run/current-system/sw/
WARN[0000] cannot call os.Executable(), assuming the executable to "
DEBU[0000] verifying process skipped
FATA[0000] readlink /proc/self/exe: no such file or directory

Would appreciate some help, so I've provided a cheat sheet:

$ echo $ROOTLESSKIT_STATE_DIR
/home/rootless/.rancher/k3s/rootless

$ echo $CONTAINERD_ADDRESS
/run/user/1000/k3s/containerd/containerd.sock

$ echo $CONTAINERD_NAMESPACE
k8s.io

# Show the rootless k3s systemd user service
$ systemctl --user status k3s

# Enter namespaces setup by k3s's rootlesskit
# Options here matches nerdctl.
$ nsenter -r/ --preserve-credentials -m -n -U -F -t $(cat $ROOTLESSKIT_STATE_DIR/child_pid)

# Note that commands that don't require nerdctl to nsenter works fine
# Just need to wait until k3s is healthy
$ kubectl get nodes
NAME    STATUS   ROLES                  AGE     VERSION
nixos   Ready    control-plane,master   9m58s   v1.27.9+k3s1

$ nerdctl image ls
REPOSITORY                          TAG                     IMAGE ID        CREATED               PLATFORM       SIZE         BLOB SIZE
rancher/klipper-helm                v0.8.2-build20230815    b0b0c4f73f23    10 minutes ago        linux/amd64    244.7 MiB    86.7 MiB
# ...

# k3s's embedded containerd state dir:
$ ls ~/.rancher/k3s/agent/containerd/

# Look at containerd logs
cat ~/.rancher/k3s/agent/containerd/containerd.log

# User `rootless` is a sudoer inside this QEMU VM in case you need it
$ sudo su
Password: rootless

hinshun avatar Feb 18 '24 14:02 hinshun

"cgroup v2 evacuation" is quite complex, maybe k3s should just depend on k3d with rootless (Docker|Podman|nerdctl) to reimplement the rootless mode as in Usernetes Gen2

image https://github.com/AkihiroSuda/AkihiroSuda/blob/master/slides/2024/20240201%20%5BHPC%20Containers%5D%20Rootless%20Containers.pdf

AkihiroSuda avatar Feb 18 '24 16:02 AkihiroSuda

Can you elaborate why “cgroup v2 evacuation” might be related to this readlink /proc/self/exe issue?

hinshun avatar Feb 18 '24 16:02 hinshun

Can you elaborate why “cgroup v2 evacuation” might be related to this readlink /proc/self/exe issue?

This doesn't seem directly related to cgroup per se, but as you mentioned in the OP this incurs unsharing PIDNS and mounting a new procfs, which seems related to /proc/self/exe errors

AkihiroSuda avatar Feb 18 '24 16:02 AkihiroSuda

Would it make sense if rootless k3s had an option to run without “cgroup v2 evacuation” & PIDNS? I’m not sure what it does, so it’s unclear to me whether that’s reasonable or not.

hinshun avatar Feb 18 '24 16:02 hinshun

Would it make sense if rootless k3s had an option to run without “cgroup v2 evacuation” & PIDNS? I’m not sure what it does, so it’s unclear to me whether that’s reasonable or not.

No, Kubernetes pods will not start then due to lack of access to cgroup

AkihiroSuda avatar Feb 18 '24 16:02 AkihiroSuda

Can you elaborate why “cgroup v2 evacuation” might be related to this readlink /proc/self/exe issue?

This doesn't seem directly related to cgroup per se, but as you mentioned in the OP this incurs unsharing PIDNS and mounting a new procfs, which seems related to /proc/self/exe errors

@AkihiroSuda I really don't understand how the unsharing a new PID ns should impact nerdctl here ? nerdctl do not change its PID ns

fahedouch avatar Feb 18 '24 18:02 fahedouch

@AkihiroSuda @fahedouch Anything else I can help provide? I don't think this is isolated to nix-snapshotter but just nerdctl <-> rootless k3s altogether. Would love to have full docker-UX experience with rootless mode containerd & Kubernetes.

hinshun avatar Mar 24 '24 02:03 hinshun