kube-spawn cgroup-per-qos doesn't work

kubelet logs:

server.go:509] --cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /
kubelet[216]: error: failed to run Kubelet: invalid configuration: cgroup-root "/" doesn't exist: <nil>

https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md https://github.com/haveatry/kubernetes/wiki/resource-qos

current workaround:

--enforce-node-allocatable= and --cgroups-per-qos=false in scripts/bootstrap.sh#L43-L44

Apr 10 '17 15:04 robertgzr

I do not think this is an issue anymore. Closing.

Aug 11 '17 17:08 blixtra

@blixtra this is not fixed though, we just disable it https://github.com/kinvolk/kube-spawn/blob/master/etc/kube_20-kubeadm-extra-args.conf#L5

Shouldn't we track this somehow?

Aug 11 '17 18:08 robertgzr

Now that https://github.com/kubernetes-incubator/rktlet/pull/124 is implemented, I wanted to test it under kube-spawn with the qos re-enabled. I need to specify --cgroup-root= with the cgroup path of the systemd-nspawn container. It should be a path like /machine.slice/machine-kube\x2dspawn\x2d0.scope. I would like to do a couple of changes:

get rid of the - (dashes) in the machine name, so it would be kubespawn0, kubespawn1 etc. That would simplify the escaping in the cgroup paths and it would be easier to write --cgroup-root=.
make cmd/kube-spawn/setup.go:writeKubeadmExtraArgs() specific to the node number: each node would need to be passed a different kubeadm parameter for the cgroup. Is it possible?

Sep 22 '17 16:09 alban

https://github.com/kinvolk/kube-spawn/pull/144 is now merged, so the cgroup paths on each nodes are:

/machine.slice/machine-kubespawn0.scope
/machine.slice/machine-kubespawn1.scope
/machine.slice/machine-kubespawn2.scope etc.

I see different options to go forward, none of them are perfect solutions:

1. kube-spawn to pass a different `--cgroup-root=` on each node

It is not so easy to do because:

the kubeadm parameters are currently the same for all nodes, so kube-spawn will need to be refactored. See https://github.com/kinvolk/kube-spawn/commit/c426113feb28d0a1d352990e52585814bc0dfcb6 for pointers.
when the machine is started as a systemd service (e.g. systemd-run + systemd-nspawn --keep-unit), systemd-spawn creates a sub-cgroup called "payload" (since systemd-v226, see code). So kube-spawn would need to have knowledge of this systemd internal implementation detail to know it has to pass --cgroup-root=/machine.slice/machine-kubespawn0.scope/payload.

2. Use cgroup namespaces

Currently, kube-spawn disables cgroup namespaces by setting the environment variable SYSTEMD_NSPAWN_USE_CGNS=0. If we enable cgroup namespaces, then the machine will have the illusion of being at the root of the cgroup tree and we would not need to pass --cgroup-root= to the Kubelet.

However, cgroup namespaces were disabled for a reason: the machine needs to have read-write access to the different cgroup controllers (mem, cpu etc.) but systemd-nspawn only mounts them read-only. To work around that, kube-spawn bind-mounts /sys/fs/cgroup from the host with the --bind parameter. For this bind mount to work (and be correctly synchronised with /proc/$pid/cgroup in the container), cgroup namespace needs to be disabled.

A solution could be to add an option in systemd-nspawn to keep the cgroupfs in read-write mode.

3. Use a better cgroup root default in Kubelet

Whenever --cgroup-root= is not specified, Kubelet uses / as the default. Kubelet could be patched to use a better default. It could look at /proc/1/cgroup and take the default root cgroup from there. But since systemd itself lives in the init.scope (since systemd-v226), Kubelet would need to remove that suffix. That could be considered systemd-specific code.

4. Hardcode the cgroup root and only test on 1-node clusters

For the purpose of a quick test, kube-spawn could just hardcode the cgroup root /machine.slice/machine-kubespawn0.scope/payload and we could only test on a 1-node cluster.

Sep 26 '17 14:09 alban

Option 4 is not trivial either: I hard-coded:

--cgroup-root=/machine.slice/machine-kubespawn0.scope

But then the Kubelet fails to start:

[pid 10531] stat("/sys/fs/cgroup/hugetlb/machine.slice/machine-machine_kubespawn0.scope.slice", 0xc420afb078) = -1 ENOENT (No such file or directory)
[pid 10531] write(2, "error: failed to run Kubelet: invalid configuration: cgroup-root \"/machine.slice/machine-kubespawn0.scope\" doesn't exist: unable to find data for container /\n", 158) = 158
[pid 10531] exit_group(1)               = ?

Some problems:

it fails on hugetlb and does not continue further. But systemd-nspawn does not create sub-cgroups for the machine on this controller (also, perf_event, cpuset, net_cls, net_prio and freezer will have the same problem)
it seems to interpret the cgroup root as some kind of systemd slice (see .scope.slice suffix), which makes it not-suitable for filtering on machine cgroup subtree.

Sep 26 '17 15:09 alban

cgroup-per-qos doesn't work

1. kube-spawn to pass a different --cgroup-root= on each node

2. Use cgroup namespaces

3. Use a better cgroup root default in Kubelet

4. Hardcode the cgroup root and only test on 1-node clusters

1. kube-spawn to pass a different `--cgroup-root=` on each node