kubespray optimize cgroups settings for node reserved

What type of PR is this?

/kind feature

What this PR does / why we need it:

If cgroups driver is systemd and kube_reserved or system_reserved is set

Specify kubeReservedCgroup or systemReservedCgroup in kubelet-config.v1beta1.yaml.j2
Pre-create the cgroups folder before the Kubelet service starts
Specify the cgroups slice in the Containerd Service

The cgroups setup and hierarchy refer to this article

After the setup, the cgroups hierarchy is as follows:

/ (Cgroups Root)
├── kubepods.slice
│   ├── ...
│   ├── kubepods-besteffort.slice
│   ├── kubepods-burstable.slice
│   └── ...
├── kube.slice
│   ├── ...
│   ├── containerd.service
│   ├── kubelet.service
│   └── ...
├── system.slice
│   └── ...
└── ...

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

optimize cgroups settings for node reserved

Aug 23 '22 02:08 shelmingsong

Hi @shelmingsong. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Aug 23 '22 02:08 k8s-ci-robot

Hi @shelmingsong , thank you for submitting this.

Before we accept the change, could you please explain how is this an optimisation? The design document is from the 1.7 days (2018) and the kubelet cgroup driver already takes care of setting the hierarchy, adding extra logic to the systemd units seems counter-intuitive and prone to clash with future versions of kubernetes so I would personally like to understand what is the benefit we get by adding this.

/hold

Aug 23 '22 06:08 cristicalin

Hi @cristicalin , thank you for your reply.

Let me describe the reasons for these changes in detail.

Specify kubeReservedCgroup or systemReservedCgroup in kubelet-config.v1beta1.yaml.j2

When enforceNodeAllocatable in kubelet-config.v1beta1.yaml contains kube-reserved or system-reserved, If kubeReservedCgroup or systemReservedCgroup is not specified, enforceNodeAllocatable will not take effect.

This can be seen in this official Kubernetes document: [enforcing-node-allocatable](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#enforcing-node-allocatable)

Optionally, kubelet can be made to enforce kube-reserved and system-reserved by specifying kube-reserved & system-reserved values in the same flag. Note that to enforce kube-reserved or system-reserved, --kube-reserved-cgroup or --system-reserved-cgroup needs to be specified respectively.

Pre-create the cgroups folder before the Kubelet service starts

kubeReservedCgroup and systemReservedCgroup will not take effect if we do not create the cgroups folder required by Kubelet in advance. This can be seen in the following two official Kubernetes documents:

[kube-reserved](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved)

Note that Kubelet does not create --kube-reserved-cgroup if it doesn't exist. Kubelet will fail if an invalid cgroup is specified. With systemd cgroup driver, you should follow a specific pattern for the name of the cgroup you define: the name should be the value you set for --kube-reserved-cgroup, with .slice appended.

[system-reserved](https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#system-reserved)

Note that kubelet does not create --system-reserved-cgroup if it doesn't exist. kubelet will fail if an invalid cgroup is specified. With systemd cgroup driver, you should follow a specific pattern for the name of the cgroup you define: the name should be the value you set for --system-reserved-cgroup, with .slice appended.

Specify the cgroups slice in the Containerd Service

If we do not specify cgroup slice in the Containerd service, then Containerd cgroup is created under systemd.slice by default, which is not as expected of kube-reserved.

Aug 24 '22 02:08 shelmingsong

Hi @shelmingsong , please rebase this PR on latest state of the master branch to be able to pass the CI tests.

Aug 30 '22 05:08 cristicalin

Hi @cristicalin , thank you for reminding me. I rebased, and the CI tests have now passed

Aug 31 '22 02:08 shelmingsong

Great work @shelmingsong, thank you!

/approve /ok-to-test

Aug 31 '22 13:08 cristicalin

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, shelmingsong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cristicalin]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Aug 31 '22 13:08 k8s-ci-robot

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Dec 25 '22 07:12 k8s-triage-robot

/remove-lifecycle stale

Dec 28 '22 09:12 shelmingsong

/remove-lifecycle stale

@shelmingsong Wow sorry I guess we missed this one, I'll do my best to take the time and review it, big one 😅

Dec 29 '22 09:12 floryut

/remove-lifecycle stale

@shelmingsong Wow sorry I guess we missed this one, I'll do my best to take the time and review it, big one 😅

it's okay (^▽^)

Dec 30 '22 15:12 shelmingsong

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cristicalin, floryut, shelmingsong

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cristicalin,floryut]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Dec 30 '22 16:12 k8s-ci-robot

Hello everyone, isn't this changes added some misconfiguration in kubelet.env? Because if kube_reserved set to false (which is default), kubelet starts with flag --runtime-cgroups=/kube.slice/{{ container_manager }}.service, but runtime, for example, docker used with cri-dockerd still points out to system.slice. Maybe i'm wrong and missed something, so just want to specify this out.

Feb 10 '23 05:02 mirex05

I think there are two separate functions at play here:

Resource reservation configured by KubeReserved and SystemReserved. This is like a k8s resource request. It is very important for cluster stability, to ensure that pods do not consume all node resources, to leave enough resources for daemons running on the node. For scheduling purposes the total resources on the node are simply reduced by the kube reserved and system reserved amounts, and the remaining node allocatable amount can be used by pods.
Resource enforcement configured by kubeReservedCgroup and systemReservedCgroup. This is like a k8s resource limit. Applying this can be dangerous because it would kill the critical daemons if they exceed the specified amount of resources (though that may also depend on the value of enforceNodeAllocatable ?)

Note from the docs, the kubeReservedCgroup setting is optional. It is not required for reservation; it is for enforcement. You can (and should) set KubeReserved without kubeReservedCgroup.

To optionally enforce kubeReserved on kubernetes system daemons, specify the parent control group for kube daemons as the value for kubeReservedCgroup setting, and add kube-reserved to enforceNodeAllocatable

The behaviour before this MR was that by default, resource reservation was applied (via KubeReserved), and enforcement was not (the optional parameter kubeReservedCgroup was not applied). This is the best/safest/most resilient default behaviour. However after the MR, the resource reservation disappeared by default since kube_reserved is false by default, so pods could fully consume node resources, causing important node daemons to die.

If we do not specify cgroup slice in the Containerd service, then Containerd cgroup is created under systemd.slice by default, which is not as expected of kube-reserved.

I don't think that is true, because the safe standard way to reserve resources for kubernetes daemons should be to define KubeReserved resources, without specifying the optional and unnecessary kubeReservedCgroup setting. After this MR that is no longer possible, and there are no kube-reserved resources by default.

I spent many hours trying to figure this out, if I misunderstood or made an incorrect assumption I would be happy for a clarification.

Anyway can someone please re-open https://github.com/kubernetes-sigs/kubespray/issues/9692 to address this? @cristicalin @yankay ? Thanks!

Jul 09 '24 06:07 rptaylor

I have tried https://github.com/kubernetes-sigs/kubespray/pull/11367 to fix this.

Jul 09 '24 06:07 rptaylor

kubespray kubespray copied to clipboard

optimize cgroups settings for node reserved

kubespray
kubespray copied to clipboard