sysbox icon indicating copy to clipboard operation
sysbox copied to clipboard

EPERM returned during inner-container initialization with tmpfs bind-mounts

Open rodnymolina opened this issue 4 years ago • 0 comments

The integration testcase displayed further below fails when executed as part of the test-shell-systemd makefile target. With the proper adjustments, problem can be also reproduced in slightly similar setups (see below for more details). However, there are two conditions that must be satisfied for problem to occur:

  • We must be in userns-remap mode (shiftless).
  • The mount resources being affected (see below) must be initially mounted over a /tmpfs file-system within the sys-container.

The failing testcase in question validates the proper operation of the mount-hardening feature in scenarios with inner containers [ scenario-7: unshare() + pivot-root() ].

As part of the setup, this testcase creates a series of files/folders in the sysbox-test priv container, which are bind-mounted into the L1 sys-container, and then ultimately bind-mounted into the inner (L2) container too.

See the list of bind-mounted resources as seen from the sysbox-test container:

root@sysbox-test:~/nestybox/sysbox# ls -lrt /tmp/chrootdir/
total 0
drwxr-xr-x 2 root root 40 Mar  1 19:49 ro_dir
drwxr-xr-x 2 root root 40 Mar  1 19:49 masked_dir
-rw-r--r-- 1 root root  0 Mar  1 19:49 ro_file
-rw-r--r-- 1 root root  0 Mar  1 19:49 masked_file

See again the list of bind-mounted resources but this time as seen from within the sys container:

root@b3a645ad6b53:/# ls -lrt /tmp/chrootdir/
total 0
crw-rw-rw- 1 root   root    1, 3 Mar  1 19:52 masked_file
drwxr-xr-x 2 nobody nogroup   40 Mar  1 19:54 ro_dir
drwxrwxrwt 2 root   root      40 Mar  1 19:54 masked_dir
-rw-r--r-- 1 nobody nogroup    0 Mar  1 19:54 ro_file

See below the output dumped by the failing testcase right after launching the inner container:

root@sysbox-test:~/nestybox/sysbox# bats -t tests/syscall/mount/mount-immutables-unshare-pivot.bats
1..11
not ok 1 immutable mount *can* be unmounted -- unshare(mnt) + pivot()
# (in test file tests/syscall/mount/mount-immutables-unshare-pivot.bats, line 61)
#   `[ "$status" -eq 0 ]' failed
# docker run --runtime=sysbox-runc -d --rm -v /tmp/chrootdir/ro_dir:/tmp/chrootdir/ro_dir:ro -v /tmp/chrootdir/ro_file:/tmp/chrootdir/ro_file:ro --mount type=tmpfs,destination=/tmp/chrootdir/masked_dir -v /dev/null:/tmp/chrootdir/masked_file ghcr.io/nestybox/ubuntu-bionic-docker-dbg tail -f /dev/null (status=0):
# 014a1a6d4ffa61a1d32a3f9d535537d4a5759b7707291b3917342ec956bd161d
# docker ps --format {{.ID}} (status=0):
# 014a1a6d4ffa
# docker exec -d 014a1a6d4ffa sh -c dockerd > /var/log/dockerd.log 2>&1 (status=0):
#
# docker exec 014a1a6d4ffa sh -c docker run --privileged -d --name inner -v /tmp/chrootdir/ro_dir:/tmp/chrootdir/ro_dir:ro -v /tmp/chrootdir/ro_file:/tmp/chrootdir/ro_file:ro --mount type=tmpfs,destination=/tmp/chrootdir/masked_dir -v /dev/null:/tmp/chrootdir/masked_file ghcr.io/nestybox/ubuntu:latest tail -f /dev/null (status=125):
# Unable to find image 'ghcr.io/nestybox/ubuntu:latest' locally
# latest: Pulling from nestybox/ubuntu
# 83ee3a23efb7: Pulling fs layer
# db98fc6f11f0: Pulling fs layer
# f611acd52c6c: Pulling fs layer
# db98fc6f11f0: Verifying Checksum
# db98fc6f11f0: Download complete
# f611acd52c6c: Verifying Checksum
# f611acd52c6c: Download complete
# 83ee3a23efb7: Download complete
# 83ee3a23efb7: Pull complete
# db98fc6f11f0: Pull complete
# f611acd52c6c: Pull complete
# Digest: sha256:3093096ee188f8ff4531949b8f6115af4747ec1c58858c091c8cb4579c39cc4e
# Status: Downloaded newer image for ghcr.io/nestybox/ubuntu:latest
# 1fd0afd79dae61d0e2004161bea0793c4f52491353ed847ba0ee9005aadd3593
# docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"rootfs_linux.go:58: mounting \\\"/tmp/chrootdir/ro_dir\\\" to rootfs \\\"/var/lib/docker/overlay2/548ecfc97977d639dbfd71fde3e9295da16d18cdada6a529b8d95a18a7c9378c/merged\\\" at \\\"/var/lib/docker/overlay2/548ecfc97977d639dbfd71fde3e9295da16d18cdada6a529b8d95a18a7c9378c/merged/tmp/chrootdir/ro_dir\\\" caused \\\"operation not permitted\\\"\"": unknown.
root@sysbox-test:~/nestybox/sysbox#

<-- The failure in the inner oci runc happens during the "ro_dir" bind-mount creation as part of the L2 container initialization, specifically during the remount() instruction:

 407        case "cgroup":
 408                if cgroups.IsCgroup2UnifiedMode() {
 409                        if err := mountCgroupV2(m, rootfs, mountLabel, enableCgroupns); err != nil {
 410                                return err
 411                        }
 412                } else {
 413
 414                        if err := mountCgroupV1(m, rootfs, mountLabel, enableCgroupns); err != nil {
 415                                return err
 416                        }
 417                }
 418                if m.Flags&unix.MS_RDONLY != 0 {
 419                        // remount cgroup root as readonly
 420                        mcgrouproot := &configs.Mount{
 421                                Source:      m.Destination,
 422                                Device:      "bind",
 423                                Destination: m.Destination,
 424                                Flags:       defaultMountFlags | unix.MS_RDONLY | unix.MS_BIND,
 425                        }
 426                        if err := remount(mcgrouproot, rootfs); err != nil {
 427                                return err
 428                        }
 429                }
 961func remount(m *configs.Mount, rootfs string) error {
 962        var (
 963                dest = m.Destination
 964        )
 965        if !strings.HasPrefix(dest, rootfs) {
 966                dest = filepath.Join(rootfs, dest)
 967        }
 968        return unix.Mount(m.Source, dest, m.Device, uintptr(m.Flags|unix.MS_REMOUNT), "").  <<< ----- here !!!
 969}

Details of the oci runc reproducing the problem:

# runc -v
runc version 1.0.0-rc10
commit: dc9208a3303feef5b3839f4323d9beb36df0a9dd
spec: 1.0.1-dev

Two important aspects to highlight:

  • The EPERM error received is not generated by sysbox-fs, as we're processing a RO remount operation ("/tmp/chrootdir/ro_dir"), which is skipped by sysbox-fs' mount-hardening logic. Error is returned by the Kernel itself once that sysbox-fs let the remount() instruction to pass.

  • Problem is not reproduced when the folder (/tmp) on which we host the resources being bind-mounted into the sys-container + inner-container, is a regular folder on the priv-container's rootfs, meaning that it's not a tmpfs resource.

rodnymolina avatar Mar 02 '21 08:03 rodnymolina