runtime-spec icon indicating copy to clipboard operation
runtime-spec copied to clipboard

Cgroup (namespace) Delegation

Open sargun opened this issue 7 years ago • 6 comments

Now that cgroups can be delegated, it would be nice to have the ability to specify that in the config: http://lkml.iu.edu/hypermail/linux/kernel/1801.1/01070.html

It'd be nice to stay which cgroups should be delegated, and which shouldn't. Right now, all of /sys/fs/cgroup gets bind mounted.

sargun avatar Jan 27 '18 07:01 sargun

nsdelegate is a bit of a weird design which doesn't really fit with the "container config" purpose of the runtime-spec. It only applies to cgroupv2 and it applies to all cgroups in the v2 hierarchy.

In addition, the way you set up delegation is that you have to set the nsdelegate option on the root cgroup2 mountpoint -- which is usually set up by the init system (systemd in most cases). So from the perspective of a container runtime, nsdelegate is a pre-configured setup of the system.

However, I do agree that we should have cgroup namespace support. Once we have that then we can do a "real" cgroup2 mount. But as has been discussed in other cgroup2 discussion threads, the key cgroups we need for containers (devices and freezer) are not available in cgroup2 and thus we cannot make the wholesale switch -- which will lead to a lot of issues with management of both trees.

Christian Brauner of the LXC project gave a very good talk on what the precise complications are with a hybrid cgroup setup for container runtimes and why this is quite complicated to do.

cyphar avatar Jan 28 '18 16:01 cyphar

@cyphar devices was merged into master, based on BPF filters. So, that's done. I've talked to Tejun, and it sounds like Freezer is underway, although it will not be the same approach. It seems like all we really use Freezer for (that's critical) is to terminate all processes within the container, no? This could just as easily be done by setting pids.max to 0, and walking the hierarchy, or terminating pid 1 of the pid namespace.

I agree, that mixing them is an absolute mess, and I wouldn't push that strategy at all.

sargun avatar Jan 28 '18 22:01 sargun

On Sun, Jan 28, 2018 at 08:28:05AM -0800, Aleksa Sarai wrote:

However, I do agree that we should have cgroup namespace support.

Don't we have that since your #397?

wking avatar Jan 28 '18 23:01 wking

@wking That should come in https://github.com/opencontainers/runc/pull/1184

sargun avatar Jan 29 '18 06:01 sargun

It seems like all we really use Freezer for (that's critical) is to terminate all processes within the container, no? This could just as easily be done by setting pids.max to 0, and walking the hierarchy, or terminating pid 1 of the pid namespace.

We also expose freezer with runc pause. This is only critical if the plan is for users to actually go and use cgroupv2-only right now (underneath Docker for instance -- which would break). We could also do the pids.max thing (though I'm unclear if you'd want this to be a temporary workaround or permanent). The reason we don't do that for cgroupv1 is because pids was added more recently than freezer -- and when I got the pids controller merged we already had the freezer implementation for runc.

I still am not sure about how cgroupv2 will work in practice with subtree_control (should runc be writing to the subtree_control all the way up the cgroup tree -- or only the cgroups we've created? Does that mean that we should fail if a controller wasn't enabled by the admin?). There are several administrative questions that I feel are more complicated with cgroupv2 that should be considered when implementing it.

cyphar avatar Feb 02 '18 05:02 cyphar

This was fixed via https://github.com/opencontainers/runtime-spec/pull/1123, right? :pray:

tianon avatar Mar 02 '22 23:03 tianon