EPERM returned during inner-container initialization with tmpfs bind-mounts
The integration testcase displayed further below fails when executed as part of the test-shell-systemd makefile target. With the proper adjustments, problem can be also reproduced in slightly similar setups (see below for more details). However, there are two conditions that must be satisfied for problem to occur:
- We must be in userns-remap mode (shiftless).
- The mount resources being affected (see below) must be initially mounted over a /tmpfs file-system within the sys-container.
The failing testcase in question validates the proper operation of the mount-hardening feature in scenarios with inner containers [ scenario-7: unshare() + pivot-root() ].
As part of the setup, this testcase creates a series of files/folders in the sysbox-test priv container, which are bind-mounted into the L1 sys-container, and then ultimately bind-mounted into the inner (L2) container too.
See the list of bind-mounted resources as seen from the sysbox-test container:
root@sysbox-test:~/nestybox/sysbox# ls -lrt /tmp/chrootdir/
total 0
drwxr-xr-x 2 root root 40 Mar 1 19:49 ro_dir
drwxr-xr-x 2 root root 40 Mar 1 19:49 masked_dir
-rw-r--r-- 1 root root 0 Mar 1 19:49 ro_file
-rw-r--r-- 1 root root 0 Mar 1 19:49 masked_file
See again the list of bind-mounted resources but this time as seen from within the sys container:
root@b3a645ad6b53:/# ls -lrt /tmp/chrootdir/
total 0
crw-rw-rw- 1 root root 1, 3 Mar 1 19:52 masked_file
drwxr-xr-x 2 nobody nogroup 40 Mar 1 19:54 ro_dir
drwxrwxrwt 2 root root 40 Mar 1 19:54 masked_dir
-rw-r--r-- 1 nobody nogroup 0 Mar 1 19:54 ro_file
See below the output dumped by the failing testcase right after launching the inner container:
root@sysbox-test:~/nestybox/sysbox# bats -t tests/syscall/mount/mount-immutables-unshare-pivot.bats
1..11
not ok 1 immutable mount *can* be unmounted -- unshare(mnt) + pivot()
# (in test file tests/syscall/mount/mount-immutables-unshare-pivot.bats, line 61)
# `[ "$status" -eq 0 ]' failed
# docker run --runtime=sysbox-runc -d --rm -v /tmp/chrootdir/ro_dir:/tmp/chrootdir/ro_dir:ro -v /tmp/chrootdir/ro_file:/tmp/chrootdir/ro_file:ro --mount type=tmpfs,destination=/tmp/chrootdir/masked_dir -v /dev/null:/tmp/chrootdir/masked_file ghcr.io/nestybox/ubuntu-bionic-docker-dbg tail -f /dev/null (status=0):
# 014a1a6d4ffa61a1d32a3f9d535537d4a5759b7707291b3917342ec956bd161d
# docker ps --format {{.ID}} (status=0):
# 014a1a6d4ffa
# docker exec -d 014a1a6d4ffa sh -c dockerd > /var/log/dockerd.log 2>&1 (status=0):
#
# docker exec 014a1a6d4ffa sh -c docker run --privileged -d --name inner -v /tmp/chrootdir/ro_dir:/tmp/chrootdir/ro_dir:ro -v /tmp/chrootdir/ro_file:/tmp/chrootdir/ro_file:ro --mount type=tmpfs,destination=/tmp/chrootdir/masked_dir -v /dev/null:/tmp/chrootdir/masked_file ghcr.io/nestybox/ubuntu:latest tail -f /dev/null (status=125):
# Unable to find image 'ghcr.io/nestybox/ubuntu:latest' locally
# latest: Pulling from nestybox/ubuntu
# 83ee3a23efb7: Pulling fs layer
# db98fc6f11f0: Pulling fs layer
# f611acd52c6c: Pulling fs layer
# db98fc6f11f0: Verifying Checksum
# db98fc6f11f0: Download complete
# f611acd52c6c: Verifying Checksum
# f611acd52c6c: Download complete
# 83ee3a23efb7: Download complete
# 83ee3a23efb7: Pull complete
# db98fc6f11f0: Pull complete
# f611acd52c6c: Pull complete
# Digest: sha256:3093096ee188f8ff4531949b8f6115af4747ec1c58858c091c8cb4579c39cc4e
# Status: Downloaded newer image for ghcr.io/nestybox/ubuntu:latest
# 1fd0afd79dae61d0e2004161bea0793c4f52491353ed847ba0ee9005aadd3593
# docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/tmp/chrootdir/ro_dir\\\" to rootfs \\\"/var/lib/docker/overlay2/548ecfc97977d639dbfd71fde3e9295da16d18cdada6a529b8d95a18a7c9378c/merged\\\" at \\\"/var/lib/docker/overlay2/548ecfc97977d639dbfd71fde3e9295da16d18cdada6a529b8d95a18a7c9378c/merged/tmp/chrootdir/ro_dir\\\" caused \\\"operation not permitted\\\"\"": unknown.
root@sysbox-test:~/nestybox/sysbox#
<-- The failure in the inner oci runc happens during the "ro_dir" bind-mount creation as part of the L2 container initialization, specifically during the remount() instruction:
407 case "cgroup":
408 if cgroups.IsCgroup2UnifiedMode() {
409 if err := mountCgroupV2(m, rootfs, mountLabel, enableCgroupns); err != nil {
410 return err
411 }
412 } else {
413
414 if err := mountCgroupV1(m, rootfs, mountLabel, enableCgroupns); err != nil {
415 return err
416 }
417 }
418 if m.Flags&unix.MS_RDONLY != 0 {
419 // remount cgroup root as readonly
420 mcgrouproot := &configs.Mount{
421 Source: m.Destination,
422 Device: "bind",
423 Destination: m.Destination,
424 Flags: defaultMountFlags | unix.MS_RDONLY | unix.MS_BIND,
425 }
426 if err := remount(mcgrouproot, rootfs); err != nil {
427 return err
428 }
429 }
961func remount(m *configs.Mount, rootfs string) error {
962 var (
963 dest = m.Destination
964 )
965 if !strings.HasPrefix(dest, rootfs) {
966 dest = filepath.Join(rootfs, dest)
967 }
968 return unix.Mount(m.Source, dest, m.Device, uintptr(m.Flags|unix.MS_REMOUNT), ""). <<< ----- here !!!
969}
Details of the oci runc reproducing the problem:
# runc -v
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev
Two important aspects to highlight:
-
The EPERM error received is not generated by sysbox-fs, as we're processing a RO remount operation ("/tmp/chrootdir/ro_dir"), which is skipped by sysbox-fs' mount-hardening logic. Error is returned by the Kernel itself once that sysbox-fs let the remount() instruction to pass.
-
Problem is not reproduced when the folder (
/tmp) on which we host the resources being bind-mounted into the sys-container + inner-container, is a regular folder on the priv-container's rootfs, meaning that it's not atmpfsresource.