runc icon indicating copy to clipboard operation
runc copied to clipboard

EPERM mounting sysfs with rootless/userns container

Open maleadt opened this issue 3 years ago • 10 comments

I'm trying out runc to get a simple unpriviliged containerized execution, but am having issues mounting sysfs:

"mounts": [
    {
        "destination": "/sys",
        "type": "sysfs",
        "source": "sysfs",
        "options": [
            "nosuid",
            "noexec",
            "nodev"
        ]
    }
]
❯ runc run test
ERRO[0000] runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/7), flags: 0xe: operation not permitted

Meanwhile, crun manages fine:

❯ crun run test
root@test:~# mount | grep sysfs
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Full config
{
    "ociVersion": "1.0.1",
    "platform": {
        "os": "linux",
        "arch": "amd64"
    },
    "root": {
        "path": "/home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f",
        "readonly": true
    },
    "mounts": [
        {
            "destination": "/proc",
            "type": "proc",
            "source": "proc"
        },
        {
            "destination": "/dev",
            "type": "tmpfs",
            "source": "tmpfs",
            "options": [
                "nosuid",
                "strictatime",
                "mode=755",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/pts",
            "type": "devpts",
            "source": "devpts",
            "options": [
                "nosuid",
                "noexec",
                "newinstance",
                "ptmxmode=0666",
                "mode=0620"
            ]
        },
        {
            "destination": "/dev/shm",
            "type": "tmpfs",
            "source": "shm",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "mode=1777",
                "size=65536k"
            ]
        },
        {
            "destination": "/dev/mqueue",
            "type": "mqueue",
            "source": "mqueue",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys",
            "type": "sysfs",
            "source": "sysfs",
            "options": [
                "nosuid",
                "noexec",
                "nodev"
            ]
        },
        {
            "destination": "/sys/fs/cgroup",
            "type": "cgroup",
            "source": "cgroup",
            "options": [
                "nosuid",
                "noexec",
                "nodev",
                "relatime",
                "ro"
            ]
        }
    ],
    "process": {
        "terminal": true,
        "cwd": "/root",
        "env": [
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "TERM=xterm"
        ],
        "args": [
            "/bin/bash", "--login"
        ],
        "rlimits": [
            {
                "type": "RLIMIT_NOFILE",
                "hard": 1024,
                "soft": 1024
            }
        ],
        "capabilities": {
            "bounding": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL",
                "CAP_NET_BIND_SERVICE"
            ],
            "permitted": [
                    "CAP_AUDIT_WRITE",
                    "CAP_KILL",
                    "CAP_NET_BIND_SERVICE"
                ],
            "inheritable": [
                    "CAP_AUDIT_WRITE",
                    "CAP_KILL",
                    "CAP_NET_BIND_SERVICE"
                ],
            "effective": [
                "CAP_AUDIT_WRITE",
                "CAP_KILL"
            ],
            "ambient": [
                "CAP_NET_BIND_SERVICE"
            ]
        },
        "noNewPrivileges": true
    },
    "user": {
        "uid": 0,
        "gid": 0
    },
    "hostname": "test",
    "linux": {
        "resources": {
            "devices": [
                {
                    "allow": false,
                    "access": "rwm"
                }
            ]
        },
        "namespaces": [
            {
                "type": "pid"
            },
            {
                "type": "ipc"
            },
            {
                "type": "uts"
            },
            {
                "type": "mount"
            },
            {
                "type": "user"
            },
            {
                "type": "cgroup"
            }
        ],
        "uidMappings": [
            {
                "containerID": 0,
                "hostID": 1000,
                "size": 1
            }
        ],
        "gidMappings": [
            {
                "containerID": 0,
                "hostID": 1000,
                "size": 1
            }
        ],
        "devices": null
    }
}

Binding sys instead works around the issue:

"mounts": [
    {
        "destination": "/sys",
        "type": "none",
        "source": "/sys",
        "options": [
            "rbind",
            "nosuid",
            "noexec",
            "nodev",
            "ro"
        ]
    },
]

maleadt avatar Nov 29 '22 15:11 maleadt

I barely remember this depends on the kernel version, so some kernels (mistakenly) denied this mount.

Two possible solutions are:

  1. Upgrade the kernel
  2. Do not use rootless+userns+sysfs (lack of /sys might be OK for some containers).

I am not sure what are the implications of bind-mounting host /sys, and so I would not recommend doing that (without doing some security analysis first, that is).

kolyshkin avatar Nov 30 '22 01:11 kolyshkin

Now,

  1. This is not a runc bug (but rather a kernel bug)
  2. There's nothing runc can do about this (there's no easy workaround, and bind-mounting /sys is questionable)

Based on these two points, I am closing this as not-a-bug.

Let me know if you feel different.

kolyshkin avatar Nov 30 '22 01:11 kolyshkin

There's nothing runc can do about this (there's no easy workaround, and bind-mounting /sys is questionable)

But crun manages fine? I'm unfamiliar with the exact logic taking care of mounting sysfs, but this seems to indicate that there is a way to deal with this from the runtime's side.

Also, I'm happy to upgrade my kernel, but I'm using 5.15 -- the latest LTS -- which isn't exactly ancient. It's still what e.g. Ubuntu 22.04 is using/supporting for the next 5 years or so.

maleadt avatar Nov 30 '22 06:11 maleadt

Also, this reproduces on kernel 6.0.10 (Arch Linux)...

maleadt avatar Nov 30 '22 07:11 maleadt

OK, please tell us how to repro this (what is your environment and the steps to repro) and we'll take a look.

kolyshkin avatar Dec 01 '22 01:12 kolyshkin

OK, please tell us how to repro this (what is your environment and the steps to repro) and we'll take a look.

There's not much more to to it than what I've reported here:

  • full config (as simple as possible) in the opening post
  • rootfs is a minimal Debian rootfs, but this also reproduces with, e.g., an Arch Linux rootfs (I tested with, https://github.com/JuliaCI/PkgEval.jl/releases/download/v0.1/arch-devel-20220628.tar.xz)
  • host is Arch Linux, running 6.0.10-arch2-1 (i.e. the official kernel)
  • runc is the latest static binary from GitHub releases
./runc.amd64 run test
ERRO[0000] runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/7), flags: 0xe: operation not permitted

maleadt avatar Dec 02 '22 11:12 maleadt

This is not runc bug, kernels denied this mount. this is right

why crun can mount sysfs?

because if in user namespace, crun bind /sys not sysfs

https://github.com/containers/crun/blob/2700598aa9df55945d09084ca035e1d140bc7f73/src/libcrun/linux.c#L1084

g0dA avatar Dec 06 '22 16:12 g0dA

I see; thanks!

maleadt avatar Dec 06 '22 19:12 maleadt

https://github.com/containers/crun/commit/6785cefbdf982c97a5552c9ce7017b0e8309c291

We should do the same for runc I guess

kolyshkin avatar Dec 08 '22 19:12 kolyshkin

Note that runc spec --rootless generates a spec which has /sys as a bind mount. I guess that is why we never saw this error. The code was added by #744 (specifically, commit d04cbc49d2ae4488a566eab86102c398522aaf14).

I think we still have to support replacing a proper /sys mount with a bind mount because crun does it.

kolyshkin avatar Jan 04 '23 00:01 kolyshkin