EPERM mounting sysfs with rootless/userns container
I'm trying out runc to get a simple unpriviliged containerized execution, but am having issues mounting sysfs:
"mounts": [
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev"
]
}
]
❯ runc run test
ERRO[0000] runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/7), flags: 0xe: operation not permitted
Meanwhile, crun manages fine:
❯ crun run test
root@test:~# mount | grep sysfs
sys on /sys type sysfs (rw,nosuid,nodev,noexec,relatime)
Full config
{
"ociVersion": "1.0.1",
"platform": {
"os": "linux",
"arch": "amd64"
},
"root": {
"path": "/home/tim/Julia/depot/artifacts/4d66e139e0bcfdfa5ec6a8942a938e754e17860f",
"readonly": true
},
"mounts": [
{
"destination": "/proc",
"type": "proc",
"source": "proc"
},
{
"destination": "/dev",
"type": "tmpfs",
"source": "tmpfs",
"options": [
"nosuid",
"strictatime",
"mode=755",
"size=65536k"
]
},
{
"destination": "/dev/pts",
"type": "devpts",
"source": "devpts",
"options": [
"nosuid",
"noexec",
"newinstance",
"ptmxmode=0666",
"mode=0620"
]
},
{
"destination": "/dev/shm",
"type": "tmpfs",
"source": "shm",
"options": [
"nosuid",
"noexec",
"nodev",
"mode=1777",
"size=65536k"
]
},
{
"destination": "/dev/mqueue",
"type": "mqueue",
"source": "mqueue",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys",
"type": "sysfs",
"source": "sysfs",
"options": [
"nosuid",
"noexec",
"nodev"
]
},
{
"destination": "/sys/fs/cgroup",
"type": "cgroup",
"source": "cgroup",
"options": [
"nosuid",
"noexec",
"nodev",
"relatime",
"ro"
]
}
],
"process": {
"terminal": true,
"cwd": "/root",
"env": [
"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
"TERM=xterm"
],
"args": [
"/bin/bash", "--login"
],
"rlimits": [
{
"type": "RLIMIT_NOFILE",
"hard": 1024,
"soft": 1024
}
],
"capabilities": {
"bounding": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"permitted": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"inheritable": [
"CAP_AUDIT_WRITE",
"CAP_KILL",
"CAP_NET_BIND_SERVICE"
],
"effective": [
"CAP_AUDIT_WRITE",
"CAP_KILL"
],
"ambient": [
"CAP_NET_BIND_SERVICE"
]
},
"noNewPrivileges": true
},
"user": {
"uid": 0,
"gid": 0
},
"hostname": "test",
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
}
]
},
"namespaces": [
{
"type": "pid"
},
{
"type": "ipc"
},
{
"type": "uts"
},
{
"type": "mount"
},
{
"type": "user"
},
{
"type": "cgroup"
}
],
"uidMappings": [
{
"containerID": 0,
"hostID": 1000,
"size": 1
}
],
"gidMappings": [
{
"containerID": 0,
"hostID": 1000,
"size": 1
}
],
"devices": null
}
}
Binding sys instead works around the issue:
"mounts": [
{
"destination": "/sys",
"type": "none",
"source": "/sys",
"options": [
"rbind",
"nosuid",
"noexec",
"nodev",
"ro"
]
},
]
I barely remember this depends on the kernel version, so some kernels (mistakenly) denied this mount.
Two possible solutions are:
- Upgrade the kernel
- Do not use rootless+userns+sysfs (lack of /sys might be OK for some containers).
I am not sure what are the implications of bind-mounting host /sys, and so I would not recommend doing that (without doing some security analysis first, that is).
Now,
- This is not a runc bug (but rather a kernel bug)
- There's nothing runc can do about this (there's no easy workaround, and bind-mounting /sys is questionable)
Based on these two points, I am closing this as not-a-bug.
Let me know if you feel different.
There's nothing runc can do about this (there's no easy workaround, and bind-mounting /sys is questionable)
But crun manages fine? I'm unfamiliar with the exact logic taking care of mounting sysfs, but this seems to indicate that there is a way to deal with this from the runtime's side.
Also, I'm happy to upgrade my kernel, but I'm using 5.15 -- the latest LTS -- which isn't exactly ancient. It's still what e.g. Ubuntu 22.04 is using/supporting for the next 5 years or so.
Also, this reproduces on kernel 6.0.10 (Arch Linux)...
OK, please tell us how to repro this (what is your environment and the steps to repro) and we'll take a look.
OK, please tell us how to repro this (what is your environment and the steps to repro) and we'll take a look.
There's not much more to to it than what I've reported here:
- full config (as simple as possible) in the opening post
- rootfs is a minimal Debian rootfs, but this also reproduces with, e.g., an Arch Linux rootfs (I tested with, https://github.com/JuliaCI/PkgEval.jl/releases/download/v0.1/arch-devel-20220628.tar.xz)
- host is Arch Linux, running 6.0.10-arch2-1 (i.e. the official kernel)
- runc is the latest static binary from GitHub releases
./runc.amd64 run test
ERRO[0000] runc run failed: unable to start container process: error during container init: error mounting "sysfs" to rootfs at "/sys": mount sysfs:/sys (via /proc/self/fd/7), flags: 0xe: operation not permitted
This is not runc bug, kernels denied this mount. this is right
why crun can mount sysfs?
because if in user namespace, crun bind /sys not sysfs
https://github.com/containers/crun/blob/2700598aa9df55945d09084ca035e1d140bc7f73/src/libcrun/linux.c#L1084
I see; thanks!
https://github.com/containers/crun/commit/6785cefbdf982c97a5552c9ce7017b0e8309c291
We should do the same for runc I guess
Note that runc spec --rootless generates a spec which has /sys as a bind mount. I guess that is why we never saw this error. The code was added by #744 (specifically, commit d04cbc49d2ae4488a566eab86102c398522aaf14).
I think we still have to support replacing a proper /sys mount with a bind mount because crun does it.