sysbox k8s directory mounted as nobody
Here is the situation, we are running sysbox in GKE (to run Coder), we have a mount for docker backed by a PVC, sometimes when a pod restarts, /var/lib/docker ens up being owned by nobody:nogroup in the pod:
root@coder:/# ls -lah /var/lib
drwx--x--- 12 nobody nogroup 4.0K May 6 12:30 docker
restarting the pod a bunch of times end up fixing the issue, but not able to figure out why/how I suspect that this issue happen when the pod gets scheduled in a different node ?
This is quite disruptive as the only way out is to delete that pod and make a new one, loosing the PVC, and the data associated...
pod.yaml
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
coder.workspace_id: e832bafe-2d57-4d56-8e53-a807a86d0869
strategy:
type: Recreate
template:
metadata:
annotations:
io.kubernetes.cri-o.userns-mode: auto:size=65536
creationTimestamp: null
labels:
coder.workspace_id: e832bafe-2d57-4d56-8e53-a807a86d0869
spec:
automountServiceAccountToken: true
containers:
- command:
- sh
- -c
- " set -e\n\n W_USER=MYUSER\n\n # Add a
user so that you're not developing as the `root` user\n useradd
$W_USER \\\n --create-home \\\n --shell=/bin/bash
\\\n --groups=docker \\\n --uid=1000 \\\n --user-group\n
\ echo \"$W_USER ALL=(ALL) NOPASSWD:ALL\" >>/etc/sudoers.d/nopasswd\n\n
\ # Start the Coder agent as the user once systemd has started
up\n # /!\\ The space before EOT must match the current indenting
of the terminating one!\n sudo -u $W_USER --preserve-env=CODER_AGENT_TOKEN
/bin/bash -- <<-' EOT' &\n while [[ ! $(systemctl
is-system-running) =~ ^(running|degraded) ]]\n do\n echo
\"Waiting for system to start... $(systemctl is-system-running)\"\n sleep
2\n done\n #!/usr/bin/env sh\nset -eux\n# Sleep for
a good long while before exiting.\n# This is to allow folks to exec into
a failed workspace and poke around to\n# troubleshoot.\nwaitonexit() {\n\techo
\"=== Agent script exited with non-zero code. Sleeping 24h to preserve logs...\"\n\tsleep
86400\n}\ntrap waitonexit EXIT\nBINARY_DIR=\"${BINARY_DIR:-$(mktemp -d -t
coder.XXXXXX)}\"\nBINARY_NAME=coder\nBINARY_URL=https://coder.company.com/bin/coder-linux-amd64\ncd
\"$BINARY_DIR\"\n# Attempt to download the coder agent.\n# This could fail
for a number of reasons, many of which are likely transient.\n# So just
keep trying!\nwhile :; do\n\t# Try a number of different download tools,
as we don not know what we\n\t# will have available.\n\tstatus=\"\"\n\tif
command -v curl >/dev/null 2>&1; then\n\t\tcurl -fsSL --compressed \"${BINARY_URL}\"
-o \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telif command -v wget >/dev/null
2>&1; then\n\t\twget -q \"${BINARY_URL}\" -O \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telif
command -v busybox >/dev/null 2>&1; then\n\t\tbusybox wget -q \"${BINARY_URL}\"
-O \"${BINARY_NAME}\" && break\n\t\tstatus=$?\n\telse\n\t\techo \"error:
no download tool found, please install curl, wget or busybox wget\"\n\t\texit
127\n\tfi\n\techo \"error: failed to download coder agent\"\n\techo \" command
returned: ${status}\"\n\techo \"Trying again in 30 seconds...\"\n\tsleep
30\ndone\n\nif ! chmod +x $BINARY_NAME; then\n\techo \"Failed to make $BINARY_NAME
executable\"\n\texit 1\nfi\n\nhaslibcap2() {\n\tcommand -v setcap /dev/null
2>&1\n\tcommand -v capsh /dev/null 2>&1\n}\nprintnetadminmissing() {\n\techo
\"The root user does not have CAP_NET_ADMIN permission. \" + \\\n\t\t\"If
running in Docker, add the capability to the container for \" + \\\n\t\t\"improved
network performance.\"\n\techo \"This has security implications. See https://man7.org/linux/man-pages/man7/capabilities.7.html\"\n}\n\n#
Attempt to add CAP_NET_ADMIN to the agent binary. This allows us to increase\n#
network buffers which improves network transfer speeds.\nif [ -n \"${USE_CAP_NET_ADMIN:-}\"
]; then\n\t# If running as root, we do not need to do anything.\n\tif [
\"$(id -u)\" -eq 0 ]; then\n\t\techo \"Running as root, skipping setcap\"\n\t\t#
Warn the user if root does not have CAP_NET_ADMIN.\n\t\tif ! capsh --has-p=CAP_NET_ADMIN;
then\n\t\t\tprintnetadminmissing\n\t\tfi\n\n\t# If not running as root,
make sure we have sudo perms and the \"setcap\" +\n\t# \"capsh\" binaries
exist.\n\telif sudo -nl && haslibcap2; then\n\t\t# Make sure the root user
has CAP_NET_ADMIN.\n\t\tif sudo -n capsh --has-p=CAP_NET_ADMIN; then\n\t\t\tsudo
-n setcap CAP_NET_ADMIN=+ep ./$BINARY_NAME || true\n\t\telse\n\t\t\tprintnetadminmissing\n\t\tfi\n\n\t#
If we are not running as root, cant sudo, and \"setcap\" does not exist,
we\n\t# cannot do anything.\n\telse\n\t\techo \"Unable to setcap agent binary.
To enable improved network performance, \" + \\\n\t\t\t\"give the agent
passwordless sudo permissions and the \\\"setcap\\\" + \\\"capsh\\\" binaries.\"\n\t\techo
\"This has security implications. See https://man7.org/linux/man-pages/man7/capabilities.7.html\"\n\tfi\nfi\n\nexport
CODER_AGENT_AUTH=\"token\"\nexport CODER_AGENT_URL=\"https://coder.company.com/\"\nexec
./$BINARY_NAME agent\n\n EOT\n\n exec /sbin/init\n"
env:
- name: CODER_AGENT_TOKEN
value: XXXXX
- name: SYSBOX_ALLOW_TRUSTED_XATTR
value: "FALSE"
image: us.gcr.io/XXX/docker-image-systemd
imagePullPolicy: IfNotPresent
name: coder-MYUSER-0
resources:
limits:
cpu: "1"
memory: 4Gi
requests:
cpu: "1"
memory: 4Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /ws
mountPropagation: None
name: data
subPath: workspaces
- mountPath: /home
mountPropagation: None
name: data
subPath: home
- mountPath: /var/lib/docker
mountPropagation: None
name: data
subPath: var/lib/docker
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostname: coder-MYUSER-0
restartPolicy: Always
runtimeClassName: sysbox-runc
schedulerName: default-scheduler
securityContext:
fsGroup: 1000
fsGroupChangePolicy: OnRootMismatch
runAsNonRoot: false
runAsUser: 0
shareProcessNamespace: false
terminationGracePeriodSeconds: 30
volumes:
- name: data
persistentVolumeClaim:
claimName: coder-e832bafe-2d57-4d56-8e53-a807a86d0869
Hi @raphaelfff,
Happy to help, though you should also reach out to Coder.
we have a mount for docker backed by a PVC, sometimes when a pod restarts, /var/lib/docker ens up being owned by nobody:nogroup in the pod
What type of PVC is it?
Also, how does findmnt look inside the pod when things work and when they don't?
I ask because the PVC is bind-mounted into the Sysbox pod, and Sysbox uses "ID-mapped-mounts" or "shiftfs" (see here) on top of that bind-mount in order for the files to show up with proper ownership inside the rootless Sysbox container. If files show up as nobody:nogroup, it means the ID-mapping or shiftfs mounts are not taking effect.
restarting the pod a bunch of times end up fixing the issue, but not able to figure out why/how
Interesting ... not sure what's going on. But if you can pin-it to specific K8s nodes, that's a good clue.
Happy to help, though you should also reach out to Coder.
I think coder is out of the picture here, its a pure sysbox problem imo, coder was just general context Its something to do with sysbox id mapping, and the owner docker sets on its files (smth like that)
What type of PVC is it?
Its a GKE PD
Also, how does findmnt look inside the pod when things work and when they don't?
Atm i dont have a broken env at hand... i will update when i have one, this issu was about starting investigation... Do you have any command you would recommend running ?
Okay another ws broke:
$ findmnt
TARGET SOURCE FSTYPE OPTIONS
/ overlay overlay rw,relatime,lowerdir=/var/lib/containers/storage/overlay/l/4G5KAIEXKOIRDK2Q2IM7PGZ3QX:/var/lib/containers/st
├─/run tmpfs tmpfs rw,nosuid,nodev,size=13168052k,nr_inodes=819200,mode=755,uid=165536,gid=165536,inode64
│ └─/run/lock tmpfs tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=165536,gid=165536,inode64
├─/sys sysfs sysfs rw,nosuid,nodev,noexec,relatime
│ ├─/sys/firmware tmpfs tmpfs ro,relatime,uid=165536,gid=165536,inode64
│ ├─/sys/fs/cgroup cgroup cgroup2 rw,nosuid,nodev,noexec,relatime
│ ├─/sys/devices/virtual sysboxfs[/sys/devices/virtual] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ ├─/sys/kernel sysboxfs[/sys/kernel] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ └─/sys/module/nf_conntrack/parameters
│ sysboxfs[/sys/module/nf_conntrack/parameters] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
├─/proc proc proc rw,nosuid,nodev,noexec,relatime
│ ├─/proc/bus proc[/bus] proc ro,nosuid,nodev,noexec,relatime
│ ├─/proc/fs proc[/fs] proc ro,nosuid,nodev,noexec,relatime
│ ├─/proc/irq proc[/irq] proc ro,nosuid,nodev,noexec,relatime
│ ├─/proc/sysrq-trigger proc[/sysrq-trigger] proc ro,nosuid,nodev,noexec,relatime
│ ├─/proc/acpi tmpfs tmpfs ro,relatime,uid=165536,gid=165536,inode64
│ ├─/proc/keys devtmpfs[/null] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/proc/timer_list devtmpfs[/null] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/proc/scsi tmpfs tmpfs ro,relatime,uid=165536,gid=165536,inode64
│ ├─/proc/swaps sysboxfs[/proc/swaps] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ ├─/proc/sys sysboxfs[/proc/sys] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
│ └─/proc/uptime sysboxfs[/proc/uptime] fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other
├─/dev tmpfs tmpfs rw,nosuid,size=65536k,mode=755,uid=165536,gid=165536,inode64
│ ├─/dev/mqueue mqueue mqueue rw,nosuid,nodev,noexec,relatime
│ ├─/dev/pts devpts devpts rw,nosuid,noexec,relatime,gid=165541,mode=620,ptmxmode=666
│ ├─/dev/null devtmpfs[/null] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/random devtmpfs[/random] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/kmsg devtmpfs[/null] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/shm shm tmpfs rw,nosuid,nodev,noexec,relatime,size=65536k,inode64
│ ├─/dev/termination-log /dev/root[/var/lib/kubelet/pods/74b5abbb-e331-4002-8640-2018979ba168/containers/coder/eb185c4a]
│ │ ext4 rw,relatime,idmapped,discard,errors=remount-ro
│ ├─/dev/full devtmpfs[/full] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/tty devtmpfs[/tty] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ ├─/dev/zero devtmpfs[/zero] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
│ └─/dev/urandom devtmpfs[/urandom] devtmpfs rw,relatime,size=32915868k,nr_inodes=8228967,mode=755,inode64
├─/etc/resolv.conf /var/lib/sysbox/shiftfs/d130821a-5714-4862-8fa5-41ce3be80f56[/resolv.conf]
│ shiftfs rw,nosuid,nodev,noexec,relatime
├─/etc/hostname /var/lib/sysbox/shiftfs/d130821a-5714-4862-8fa5-41ce3be80f56[/hostname]
│ shiftfs rw,relatime
├─/run/.containerenv /var/lib/sysbox/shiftfs/d130821a-5714-4862-8fa5-41ce3be80f56[/.containerenv]
│ shiftfs rw,relatime
├─/var/lib/docker /dev/sdb[/var/lib/docker] ext4 rw,relatime
├─/ws /dev/sdb[/workspaces] ext4 rw,relatime,idmapped
├─/home /dev/sdb[/home] ext4 rw,relatime,idmapped
├─/etc/hosts /dev/root[/var/lib/kubelet/pods/74b5abbb-e331-4002-8640-2018979ba168/etc-hosts]
│ ext4 rw,relatime,idmapped,discard,errors=remount-ro
├─/run/secrets/kubernetes.io/serviceaccount
│ /var/lib/sysbox/shiftfs/ba472717-3446-4b24-9d37-6530a72a68a3 shiftfs ro,relatime
├─/var/lib/k0s /dev/root[/var/lib/sysbox/k0s/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│ ext4 rw,relatime,discard,errors=remount-ro
├─/var/lib/buildkit /dev/root[/var/lib/sysbox/buildkit/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│ ext4 rw,relatime,discard,errors=remount-ro
├─/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs
│ /dev/root[/var/lib/sysbox/containerd/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│ ext4 rw,relatime,discard,errors=remount-ro
├─/var/lib/rancher/k3s /dev/root[/var/lib/sysbox/rancher-k3s/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│ ext4 rw,relatime,discard,errors=remount-ro
├─/var/lib/rancher/rke2 /dev/root[/var/lib/sysbox/rancher-rke2/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│ ext4 rw,relatime,discard,errors=remount-ro
├─/var/lib/kubelet /dev/root[/var/lib/sysbox/kubelet/2e900d261762e1e66a1ef8be87bed70c6767097fb145311cf4fc095d278e6ded]
│ ext4 rw,relatime,discard,errors=remount-ro
├─/usr/src/linux-headers-5.15.0-1054-gke
│ /dev/root[/usr/src/linux-headers-5.15.0-1054-gke] ext4 ro,relatime,idmapped,discard,errors=remount-ro
├─/usr/src/linux-gke-headers-5.15.0-1054
│ /dev/root[/usr/src/linux-gke-headers-5.15.0-1054] ext4 ro,relatime,idmapped,discard,errors=remount-ro
└─/usr/lib/modules/5.15.0-1054-gke /dev/root[/usr/lib/modules/5.15.0-1054-gke] ext4 ro,relatime,idmapped,discard,errors=remount-ro
Bump @ctalledo What is the next step ?
Hi @raphaelfff,
So per your description above, this is the mount that is showing up with nobody:nogroup correct?
├─/var/lib/docker /dev/sdb[/var/lib/docker] ext4 rw,relatime
And I can see in the pod.yaml that it's backed by a PVC.
Don't know exactly why it's showing up with nobody:nogroup (as opposed to root:root), but let me provide a bit of background to see if we can solve it.
When a Sysbox container starts, it maps the root in the container to an unprivileged user at host level (e.g., 0 -> 100000). Furthermore, when Sysbox sees that the container has a bind-mount of a host dir into the container's /var/lib/docker, Sysbox will try to either ID-map or else chown the contents of that host dir, such that they show up with proper ownership (e.g. root:root) inside the container. When the container stops, then Sysbox will revert the operation (i.e., remove the ID-map, or chown back).
This process must be failing somehow. In the past, before ID-map mounts were supported in the kernel, Sysbox would use chown and the process would sometimes fail (or be too slow) if the host dir had too many files (which is sometimes the case on /var/lib/docker mounts).
With ID-mapped mounts it's much better (no more chowing), but it requires "overlayfs-over-ID-mapped-mounts" support which landed in kernel 5.19+.
Questions:
- What kernel version do your K8s nodes have? Ideally they would be 5.19+.
- Can you provide the output of the sysbox-mgr log (
journalctl -u sysbox-mgr)? If that log shows "shifting uids at ..." then it's using thechownoperation instead of ID-mapped-mounts, which could point to the problem.
Also: make sure the host dir (PVC) that is mounted into the pod's /var/lib/docker is only mounted into one such pod at a time (i.e., /var/lib/docker can't be shared simultaneously by multiple pods with docker engines inside).
That's all that comes to mind.
Again, I think Coder should be helping you here too (even if it turns out to be a Sysbox problem).
Thanks for your answer
Here are some more deets:
$ uname -r
5.15.0-1054-gke
So that means that its not actually benefiting from id mapped amounts...
- logs are gone, i ll provide them when the issue happens again
One question: since /var/lib/docker is on a volume, it may be mounted on node A one day 1, and node B on day two, if the chown failed, i guess when starting on node B it would have the wrong UID/GID, and that could cause the nobody ?
The mount is ReadWriteOnce, and the strategy set to Recreate, that should mean only a single pod would be able to read/write
Hi @raphaelfff,
So that means that its not actually benefiting from id mapped amounts...
Correct, at least not for the volumes mounted at /var/lib/docker. That means it must be using chown, which is not ideal (can be quite slow depending on the size of /var/lib/docker, which can grow to several GBs over time; and if it takes too long the the pod start or stop can timeout and then we end up with inconsistent user/group-IDs in the files ... not good).
One question: since /var/lib/docker is on a volume, it may be mounted on node A one day 1, and node B on day two, if the chown failed, i guess when starting on node B it would have the wrong UID/GID, and that could cause the nobody ?
That's exactly right ... which is why chown is not a good solution (though it was the only one before ID-mapped-mounts on overlayfs appeared in kernel 5.19+).
Sounds like you need a K8s node with kernel 5.19+ in order for this to work reliably. With ID-mapped-mounts, the "chown" is basically instant (it's done via user-id/group-id mapping in the kernel), so it works much better.
The mount is ReadWriteOnce, and the strategy set to Recreate, that should mean only a single pod would be able to read/write
OK that's perfect.
I m gonna have to wait for GKE to upgrade their Ubuntu Containerd image kernel... Can you think of a way this nobody issue can be fixed manually? (mounting the volume into another pod and run some chmod ?)
Hi @raphaelfff,
Apologies for the belated response.
I m gonna have to wait for GKE to upgrade their Ubuntu Containerd image kernel...
Yes, I am afraid that's the only option; I am amazed GKE is still in kernel 5.15 when Linux is at 6.9 already (!). You would think they would at least offer an option to customize the kernel, but there isn't an easy one as far as I can tell.
Can you think of a way this nobody issue can be fixed manually? (mounting the volume into another pod and run some chmod ?)
The problem with chown failing usually occurs when the contents of /var/lib/docker grow to several GBs. So maybe a poor-man's solution is to keep that below 1GB, by running docker system prune periodically (?)