Using nerdctl with rootless k3s
Description
Ideally I'd like to have a single rootless stack of k3s + containerd + image builder (e.g. buildkitd). We want to use nerdctl with the rootless k3s embedded containerd.
With the following upstream contributions:
- https://github.com/k3s-io/k3s/pull/9308
- https://github.com/k3s-io/k3s/pull/9309
We can now set the following to point nerdctl to the rootless k3s containerd:
export ROOTLESSKIT_STATE_DIR="$HOME/.rancher/k3s/rootless"
export CONTAINERD_ADDRESS="$XDG_RUNTIME_DIR/k3s/containerd/containerd.sock"
export CONTAINERD_NAMESPACE="k8s.io"
We can use several commands like nerdctl image ls successfully, but when attempting to run a container, it fails to readlink /proc/self/exe:
time="2024-02-16T07:32:44Z" level=debug msg="stateDir: /home/rootless/.rancher/k3s/rootless"
time="2024-02-16T07:32:44Z" level=debug msg="rootless parent main: executing \"/run/current-system/sw/bin/nsenter\" with [-r/ -w/home/rootless --preserve-credentials -m -n -U -t 1035 -F /run/current-system/sw/bin/nerdctl --debug-full run ghcr.io/pdtpartners/hello]"
time="2024-02-16T07:32:44Z" level=warning msg="cannot call os.Executable(), assuming the executable to be \"/run/current-system/sw/bin/nerdctl\"" error="readlink /proc/self/exe: no such file or directory"
time="2024-02-16T07:32:44Z" level=debug msg="verifying process skipped"
time="2024-02-16T07:32:46Z" level=debug msg="Failed to unmount snapshot \"/tmp/initialC3943988990\""
time="2024-02-16T07:32:46Z" level=fatal msg="readlink /proc/self/exe: no such file or directory"
I tried getting around it by patching nerdctl:
diff --git a/pkg/cmd/container/create.go b/pkg/cmd/container/create.go
index ca40bbe4..d205be26 100644
--- a/pkg/cmd/container/create.go
+++ b/pkg/cmd/container/create.go
@@ -406,7 +406,8 @@ func withBindMountHostIPC(_ context.Context, _ oci.Client, _ *containers.Contain
func GenerateLogURI(dataStore string) (*url.URL, error) {
selfExe, err := os.Executable()
if err != nil {
- return nil, err
+ log.L.WithError(err).Warnf("cannot call os.Executable(), assuming the executable to be %q", os.Args[0])
+ selfExe = os.Args[0]
}
args := map[string]string{
logging.MagicArgv1: dataStore,
It gets a little further but still have trouble with /proc/self/fd:
time="2024-02-16T07:39:21Z" level=debug msg="stateDir: /home/rootless/.rancher/k3s/rootless"
time="2024-02-16T07:39:21Z" level=debug msg="rootless parent main: executing \"/run/current-system/sw/bin/nsenter\" with [-r/ -w/home/rootless --preserve-credenn
tials -m -n -U -t 1019 -F /run/current-system/sw/bin/nerdctl --debug-full run ghcr.io/pdtpartners/hello]"
time="2024-02-16T07:39:21Z" level=warning msg="cannot call os.Executable(), assuming the executable to be \"/run/current-system/sw/bin/nerdctl\"" error="readlink /proc/self/exe: no such file or directory"
time="2024-02-16T07:39:21Z" level=debug msg="verifying process skipped"
time="2024-02-16T07:39:24Z" level=debug msg="Failed to unmount snapshot \"/tmp/initialC3550973183\""
time="2024-02-16T07:39:24Z" level=warning msg="cannot call os.Executable(), assuming the executable to be \"/run/current-system/sw/bin/nerdctl\"" error="readlinn
k /proc/self/exe: no such file or directory"
time="2024-02-16T07:39:24Z" level=debug msg="generated log driver: binary:///run/current-system/sw/bin/nerdctl?_NERDCTL_INTERNAL_LOGGING=%2Fvar%2Flib%2Fnerdctl%%
2F4a156993"
time="2024-02-16T07:39:24Z" level=debug msg="remote introspection plugin filters" filters="[type==io.containerd.snapshotter.v1, id==nix]"
time="2024-02-16T07:39:24Z" level=fatal msg="failed to open stdout fifo: couldn't stat /proc/self/fd/7: stat /proc/self/fd/7: no such file or directory"
I'm speculating the root cause is because rootless k3s sets up a PIDNS (See: https://github.com/k3s-io/k3s/blob/v1.29.1%2Bk3s2/pkg/rootless/rootless.go#L144)? Although it is required for cgroupv2 evacuation.
Do you have any ideas? cc @AkihiroSuda
Describe the results you received and expected
Possible to run containers using nerdctl with rootless k3s containerd
What version of nerdctl are you using?
v1.7.0
nerdctl should not (does not) enter the PIDNS setup by rootless k3s. Do you change the PID namespace of nerdctl at any point?
nerdctlshould not (does not) enter thePIDNSsetup by rootless k3s. Do you change the PID namespace of nerdctl at any point?
I did not. This is failing with stock v1.7.0, my patch above was just for investigation. I see what you mean though, you’re saying for the rootless child /proc/self/exe should be available since it didn’t enter the PIDNS?
To be honest, I’m unfamiliar with what conditions where readlink /proc/self/exe could fail. I will provide a docker run environment for reproducing this.
you’re saying for the rootless child /proc/self/exe should be available since it didn’t enter the PIDNS?
nope, I am saying that nerdctl do not enter the PIDNS. But the rootless child is entering the PIDNS. Here, it is a nerdctl issue so this may not related to PIDNS
nope, I am saying that nerdctl do not enter the PIDNS. But the rootless child is entering the PIDNS. Here, it is a nerdctl issue so this may not related to PIDNS
I meant the rootless child of nerdctl, so I think we're saying the same thing!
I built & pushed a docker image to ghcr.io/pdtpartners/nix-snapshotter that reproduces the issue. The entrypoint of the image launches a non-gui QEMU VM with NixOS with rootless k3s in a systemd user service:
docker run --rm -it ghcr.io/pdtpartners/nix-snapshotter:rootless
nixos login: rootless # (Ctrl-a then x to quit)
Password: rootless
[rootless@nixos:~]$ nerdctl run --debug-full hello-world
DEBU[0000] stateDir: /home/rootless/.rancher/k3s/rootless
DEBU[0000] rootless parent main: executing "/run/current-system/sw/
WARN[0000] cannot call os.Executable(), assuming the executable to "
DEBU[0000] verifying process skipped
FATA[0000] readlink /proc/self/exe: no such file or directory
Would appreciate some help, so I've provided a cheat sheet:
$ echo $ROOTLESSKIT_STATE_DIR
/home/rootless/.rancher/k3s/rootless
$ echo $CONTAINERD_ADDRESS
/run/user/1000/k3s/containerd/containerd.sock
$ echo $CONTAINERD_NAMESPACE
k8s.io
# Show the rootless k3s systemd user service
$ systemctl --user status k3s
# Enter namespaces setup by k3s's rootlesskit
# Options here matches nerdctl.
$ nsenter -r/ --preserve-credentials -m -n -U -F -t $(cat $ROOTLESSKIT_STATE_DIR/child_pid)
# Note that commands that don't require nerdctl to nsenter works fine
# Just need to wait until k3s is healthy
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
nixos Ready control-plane,master 9m58s v1.27.9+k3s1
$ nerdctl image ls
REPOSITORY TAG IMAGE ID CREATED PLATFORM SIZE BLOB SIZE
rancher/klipper-helm v0.8.2-build20230815 b0b0c4f73f23 10 minutes ago linux/amd64 244.7 MiB 86.7 MiB
# ...
# k3s's embedded containerd state dir:
$ ls ~/.rancher/k3s/agent/containerd/
# Look at containerd logs
cat ~/.rancher/k3s/agent/containerd/containerd.log
# User `rootless` is a sudoer inside this QEMU VM in case you need it
$ sudo su
Password: rootless
"cgroup v2 evacuation" is quite complex, maybe k3s should just depend on k3d with rootless (Docker|Podman|nerdctl) to reimplement the rootless mode as in Usernetes Gen2
https://github.com/AkihiroSuda/AkihiroSuda/blob/master/slides/2024/20240201%20%5BHPC%20Containers%5D%20Rootless%20Containers.pdf
Can you elaborate why “cgroup v2 evacuation” might be related to this readlink /proc/self/exe issue?
Can you elaborate why “cgroup v2 evacuation” might be related to this readlink /proc/self/exe issue?
This doesn't seem directly related to cgroup per se, but as you mentioned in the OP this incurs unsharing PIDNS and mounting a new procfs, which seems related to /proc/self/exe errors
Would it make sense if rootless k3s had an option to run without “cgroup v2 evacuation” & PIDNS? I’m not sure what it does, so it’s unclear to me whether that’s reasonable or not.
Would it make sense if rootless k3s had an option to run without “cgroup v2 evacuation” & PIDNS? I’m not sure what it does, so it’s unclear to me whether that’s reasonable or not.
No, Kubernetes pods will not start then due to lack of access to cgroup
Can you elaborate why “cgroup v2 evacuation” might be related to this readlink /proc/self/exe issue?
This doesn't seem directly related to cgroup per se, but as you mentioned in the OP this incurs unsharing PIDNS and mounting a new procfs, which seems related to /proc/self/exe errors
@AkihiroSuda I really don't understand how the unsharing a new PID ns should impact nerdctl here ? nerdctl do not change its PID ns
@AkihiroSuda @fahedouch Anything else I can help provide? I don't think this is isolated to nix-snapshotter but just nerdctl <-> rootless k3s altogether. Would love to have full docker-UX experience with rootless mode containerd & Kubernetes.