rootless: support Google Container-Optimized OS (Fix ` Options:[rbind ro]}]: operation not permitted` errors)
Dockerfile has VOLUME /home/user/.local/share/buildkit by default too, but the default VOLUME does not work with rootless on Google's Container-Optimized OS as it is mounted with nosuid,nodev.
So the volume has to be explicitly mounted as an emptyDir volume.
Tested with GKE Autopilot 1.24.3-gke.200 (kernel 5.10.123+, containerd 1.6.6).
Fix #879
Thanks to Andrew Grigorev (@ei-grad) and Ben Cressey (@bcressey).
Interesting! This doesn't actually fix the issue on Bottlerocket; the emptyDir mount still has the problematic nosuid,nodev flags:
/dev/nvme1n1p1 on /home/user/.local/share/buildkit type ext4 (rw,seclabel,nosuid,nodev,noatime)
It's great that it works on GKE and GCOS though. I wonder if it's because the backing directory for emptyDir mounts there is a bind mount that's been remounted with dev,suid. Rather than a change here, that would point to the need for a corresponding fix in Bottlerocket so this works as expected.
@AkihiroSuda any chance you could check your GCOS host (via findmnt -o target,vfs-options or mount) to see if either /var/lib/kubelet or the pod-specific kubernetes.io~empty-dir volume is mounted with different options?
This doesn't actually fix the issue on Bottlerocket; the
emptyDirmount still has the problematicnosuid,nodevflags:/dev/nvme1n1p1 on /home/user/.local/share/buildkit type ext4 (rw,seclabel,nosuid,nodev,noatime)
Thanks for the info 👀 , removed Bottlerocket from the PR description.
any chance you could check your GCOS host (via
findmnt -o target,vfs-optionsormount) to see if either/var/lib/kubeletor the pod-specifickubernetes.io~empty-dirvolume is mounted with different options?
With emptyDir: /dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,relatime,commit=30)
$ kubectl exec buildkitd -- mount
W0909 17:28:16.661963 2250 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/208/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/207/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/206/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/205/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/204/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/203/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/209/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/209/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,commit=30)
/dev/sda1 on /dev/termination-log type ext4 (rw,relatime,commit=30)
/dev/sda1 on /etc/hostname type ext4 (rw,nosuid,nodev,relatime,commit=30)
/dev/sda1 on /etc/resolv.conf type ext4 (rw,nosuid,nodev,relatime,commit=30)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,size=2097152k)
/dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,relatime,commit=30)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
tmpfs on /sys/firmware type tmpfs (ro,relatime)
Without emptyDir: /dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,nosuid,nodev,relatime,commit=30)
$ kubectl exec buildkitd-bad -- mount
W0909 17:31:03.192574 2257 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
overlay on / type overlay (rw,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/210/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/209/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/208/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/207/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/206/fs:/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/205/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/211/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/211/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/rdma type cgroup (ro,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,nosuid,nodev,noexec,relatime,pids)
/dev/sda1 on /etc/hosts type ext4 (rw,relatime,commit=30)
/dev/sda1 on /dev/termination-log type ext4 (rw,relatime,commit=30)
/dev/sda1 on /etc/hostname type ext4 (rw,nosuid,nodev,relatime,commit=30)
/dev/sda1 on /etc/resolv.conf type ext4 (rw,nosuid,nodev,relatime,commit=30)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
tmpfs on /run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,relatime,size=2097152k)
/dev/sda1 on /home/user/.local/share/buildkit type ext4 (rw,nosuid,nodev,relatime,commit=30)
proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/fs type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/irq type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sys type proc (ro,nosuid,nodev,noexec,relatime)
proc on /proc/sysrq-trigger type proc (ro,nosuid,nodev,noexec,relatime)
tmpfs on /proc/acpi type tmpfs (ro,relatime)
tmpfs on /proc/kcore type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,relatime)
tmpfs on /sys/firmware type tmpfs (ro,relatime)
For the long-term solution, we will have to copy this to somewhere in containerd's pkg mount pkg
https://github.com/moby/moby/blob/v20.10.17/daemon/oci_linux.go#L420-L470
// Get the set of mount flags that are set on the mount that contains the given
// path and are locked by CL_UNPRIVILEGED. This is necessary to ensure that
// bind-mounting "with options" will not fail with user namespaces, due to
// kernel restrictions that require user namespace mounts to preserve
// CL_UNPRIVILEGED locked flags.
func getUnprivilegedMountFlags(path string) ([]string, error) {
Can we merge this? (w/ https://github.com/docker/buildx/pull/1310)