cgroup-per-qos doesn't work
kubelet logs:
server.go:509] --cgroups-per-qos enabled, but --cgroup-root was not specified. defaulting to /
kubelet[216]: error: failed to run Kubelet: invalid configuration: cgroup-root "/" doesn't exist: <nil>
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node-allocatable.md https://github.com/haveatry/kubernetes/wiki/resource-qos
current workaround:
--enforce-node-allocatable= and --cgroups-per-qos=false in scripts/bootstrap.sh#L43-L44
I do not think this is an issue anymore. Closing.
@blixtra this is not fixed though, we just disable it https://github.com/kinvolk/kube-spawn/blob/master/etc/kube_20-kubeadm-extra-args.conf#L5
Shouldn't we track this somehow?
Now that https://github.com/kubernetes-incubator/rktlet/pull/124 is implemented, I wanted to test it under kube-spawn with the qos re-enabled. I need to specify --cgroup-root= with the cgroup path of the systemd-nspawn container. It should be a path like /machine.slice/machine-kube\x2dspawn\x2d0.scope. I would like to do a couple of changes:
- get rid of the
-(dashes) in the machine name, so it would bekubespawn0,kubespawn1etc. That would simplify the escaping in the cgroup paths and it would be easier to write--cgroup-root=. - make
cmd/kube-spawn/setup.go:writeKubeadmExtraArgs()specific to the node number: each node would need to be passed a different kubeadm parameter for the cgroup. Is it possible?
https://github.com/kinvolk/kube-spawn/pull/144 is now merged, so the cgroup paths on each nodes are:
- /machine.slice/machine-kubespawn0.scope
- /machine.slice/machine-kubespawn1.scope
- /machine.slice/machine-kubespawn2.scope etc.
I see different options to go forward, none of them are perfect solutions:
1. kube-spawn to pass a different --cgroup-root= on each node
It is not so easy to do because:
- the kubeadm parameters are currently the same for all nodes, so kube-spawn will need to be refactored. See https://github.com/kinvolk/kube-spawn/commit/c426113feb28d0a1d352990e52585814bc0dfcb6 for pointers.
- when the machine is started as a systemd service (e.g.
systemd-run+systemd-nspawn --keep-unit), systemd-spawn creates a sub-cgroup called "payload" (since systemd-v226, see code). So kube-spawn would need to have knowledge of this systemd internal implementation detail to know it has to pass--cgroup-root=/machine.slice/machine-kubespawn0.scope/payload.
2. Use cgroup namespaces
Currently, kube-spawn disables cgroup namespaces by setting the environment variable SYSTEMD_NSPAWN_USE_CGNS=0. If we enable cgroup namespaces, then the machine will have the illusion of being at the root of the cgroup tree and we would not need to pass --cgroup-root= to the Kubelet.
However, cgroup namespaces were disabled for a reason: the machine needs to have read-write access to the different cgroup controllers (mem, cpu etc.) but systemd-nspawn only mounts them read-only. To work around that, kube-spawn bind-mounts /sys/fs/cgroup from the host with the --bind parameter. For this bind mount to work (and be correctly synchronised with /proc/$pid/cgroup in the container), cgroup namespace needs to be disabled.
A solution could be to add an option in systemd-nspawn to keep the cgroupfs in read-write mode.
3. Use a better cgroup root default in Kubelet
Whenever --cgroup-root= is not specified, Kubelet uses / as the default. Kubelet could be patched to use a better default. It could look at /proc/1/cgroup and take the default root cgroup from there. But since systemd itself lives in the init.scope (since systemd-v226), Kubelet would need to remove that suffix. That could be considered systemd-specific code.
4. Hardcode the cgroup root and only test on 1-node clusters
For the purpose of a quick test, kube-spawn could just hardcode the cgroup root /machine.slice/machine-kubespawn0.scope/payload and we could only test on a 1-node cluster.
Option 4 is not trivial either: I hard-coded:
--cgroup-root=/machine.slice/machine-kubespawn0.scope
But then the Kubelet fails to start:
[pid 10531] stat("/sys/fs/cgroup/hugetlb/machine.slice/machine-machine_kubespawn0.scope.slice", 0xc420afb078) = -1 ENOENT (No such file or directory)
[pid 10531] write(2, "error: failed to run Kubelet: invalid configuration: cgroup-root \"/machine.slice/machine-kubespawn0.scope\" doesn't exist: unable to find data for container /\n", 158) = 158
[pid 10531] exit_group(1) = ?
Some problems:
- it fails on
hugetlband does not continue further. But systemd-nspawn does not create sub-cgroups for the machine on this controller (also, perf_event, cpuset, net_cls, net_prio and freezer will have the same problem) - it seems to interpret the cgroup root as some kind of systemd slice (see
.scope.slicesuffix), which makes it not-suitable for filtering on machine cgroup subtree.