enhancements icon indicating copy to clipboard operation
enhancements copied to clipboard

KEP-4471: kubeadm: make a control-plane's kubelet talk to the local API Server on kubeadm join.

Open chrischdi opened this issue 1 year ago • 13 comments

  • One-line PR description: adding new KEP
  • Issue link: #4471
  • Other comments:

chrischdi avatar Feb 08 '24 09:02 chrischdi

cc @neolit123 @fabriziopandini @sbueringer :-) PTAL

chrischdi avatar Feb 08 '24 09:02 chrischdi

FYI: I'm on PTO the next days and will be back on 15th. Will start addressing feedback then :-)

chrischdi avatar Feb 08 '24 18:02 chrischdi

Sounds reasonable overall, but Fabrizio and Lubomir are way more familiar with the kubeadm details

sbueringer avatar Feb 09 '24 10:02 sbueringer

Pointing the control-plane's kubelet to the local kube-apiserver may cause some stability issues. When kube-apiserver fails, the local kubelet will crash, and the node will become NotReady, the entire control-plane node cannot handle the new Pod. However, using LB will not cause the entire control-plane node to be NotReady, just some components can not work well. The second point is that upgrading the kubelet binary version seems to be user control, and kubeadm only modifies the kubelet configurations and does not immediately update the binary version of kubelet. https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/#upgrade-kubelet-and-kubectl

When kubeadm join is executed, version skew does not occur, and that only occurs during the upgrade phase.

SataQiu avatar Feb 25 '24 03:02 SataQiu

Pointing the control-plane's kubelet to the local kube-apiserver may cause some stability issues. When kube-apiserver fails, the local kubelet will crash, and the node will become NotReady, the entire control-plane node cannot handle the new Pod.

I agree, I will add that information. One thought: in case of using kubeadm, kube-apiserver is run as static pod anyway, so shouldn't the kubelet be able to bring it up again?

An improvement may be to only use the local endpoint during TLS Bootstrap and make kubeadm change the config afterwards back to the load balancer 🤔

However, using LB will not cause the entire control-plane node to be NotReady, just some components can not work well. The second point is that upgrading the kubelet binary version seems to be user control, and kubeadm only modifies the kubelet configurations and does not immediately update the binary version of kubelet. kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/#upgrade-kubelet-and-kubectl

When kubeadm join is executed, version skew does not occur, and that only occurs during the upgrade phase.

In case of immutable upgrades (e.g. Cluster API) where you update the whole image and the image contains the binary versions: when correctly following the version skew you would have to (example: v1.x -> v1.y):

  • have one image A which has v1.y kubeadm + v1.x kubelet
  • have one image B which has v1.y kubeadm + v1.y kubelet

And upgrade via:

  1. rolling update control-plane nodes to image A
  2. rolling update control-plane nodes to image B

Everyone nowadays skips step 1 with CAPI (because normally it works).

chrischdi avatar Feb 26 '24 07:02 chrischdi

Pointing the control-plane's kubelet to the local kube-apiserver may cause some stability issues. When kube-apiserver fails, the local kubelet will crash, and the node will become NotReady, the entire control-plane node cannot handle the new Pod.

I agree, I will add that information. One thought: in case of using kubeadm, kube-apiserver is run as static pod anyway, so shouldn't the kubelet be able to bring it up again?

this is something that i mentioned/asked on the kubeadm office hours zoom call we had. i think the new way is less HA, and less resilient to apiserver component faliures.

currently with a kubelet pointing at the LB if a local apiserver fails (e.g. due to an internal apiserver bug) the kubelet will try to find a healthy kube-apiserver and the Node will remain healthy, while in parallel it will try to restart the local filing pod. with the new way if the apiserver fails the node will become not ready.

An improvement may be to only use the local endpoint during TLS Bootstrap and make kubeadm change the config afterwards back to the load balancer 🤔

bootstrap to local apiserver makes sense. but if it then points to the LB it can still violate the skew policy, which according to the prior discussions from api-machinery can cause problems. we haven't seen such problems yet, mainly because of k8s kubelet / apiserver stability.

i was thinking that maybe we can let the user control this behavior with an option in the config API. i personally don't like this idea, but it can be added in v1beta4 if people vote +1 for that.

alternatively, the default behavior can be - point to local, but they can manually edit to point to LB if they want more HA.

neolit123 avatar Feb 26 '24 08:02 neolit123

@SataQiu @pacoxu any more comments here?

i was thinking that maybe we can let the user control this behavior with an option in the config API. i personally don't like this idea, but it can be added in v1beta4 if people vote +1 for that.

still not a fan of making it configurable as pointing to the LB can violate kubelet / apiserver skew, but to make it configurable by kubeadm we can do one of these things:

  1. add a new patchtarget kubeletkubeconfig. problem with this is that the user will likely only use this patch target for changing server and it will be a rare use case.
  2. add clusterconfiguration.pointKubeletAtControlPlaneEndpoint false by default, still a rare use case.
  3. document somewhere (maybe in the kubeadm HA docs at k/website) that users can edit a node kubelet.conf kubeconfig file to point to LB and kubeadm or kubelet will not modify the contents of this file for e.g. cert rotation. i.e. the server value will persist.

my vote goes to 3. WDYT?

EDIT: and yes we should still have a FG while the feature graduates.

neolit123 avatar Apr 29 '24 06:04 neolit123

also vote to 3 When designing the kubeadm upgrade apply phases, we should support skipping the kubelet upgrade phase. Then users can upgrade the kubelet version after ensuring that all the control-plane apiserver instances have been upgraded.

Just like kubeadm upgrade node --skip-phases kubelet-config

FYI: It seems that kubeadm just configures kubelet, but not restart the kubelet service. We can state in the documentation that please restart the kubelet service in turn after all the control-plane instances have been upgraded.

SataQiu avatar Apr 29 '24 10:04 SataQiu

also vote to 3

👍

When designing the kubeadm upgrade apply phases, we should support skipping the kubelet upgrade phase. Then users can upgrade the kubelet version after ensuring that all the control-plane apiserver instances have been upgraded.

upgrade does not touch "etc/kubernetes/kubelet.conf" today, so if the user modifies the "server" in there it will persist.

FYI: It seems that kubeadm just configures kubelet, but not restart the kubelet service. We can state in the documentation that please restart the kubelet service in turn after all the control-plane instances have been upgraded.

yes, a kubelet restart will be required. they have to do it one time after changing the value of "server"

neolit123 avatar Apr 29 '24 11:04 neolit123

3. document somewhere (maybe in the kubeadm HA docs at k/website) that users can edit a node kubelet.conf kubeconfig file to point to LB and kubeadm or kubeadm will not modify the contents of this file for e.g. cert rotation. i.e. the server value will persist.

IIUC, kubelet.conf change is a behavior change or a bugfix.

  • Need we notice cluster-api or other kubeadm adapters about this?

vote to 2 and 3. 3 may be the right direction, but 2 is also a good choice to me.

pacoxu avatar Apr 30 '24 08:04 pacoxu

  1. document somewhere (maybe in the kubeadm HA docs at k/website) that users can edit a node kubelet.conf kubeconfig file to point to LB and kubeadm or kubeadm will not modify the contents of this file for e.g. cert rotation. i.e. the server value will persist.

IIUC, kubelet.conf change is a behavior change or a bugfix.

we can say it's a bug fix for potential skew problems that introduces a change in behavior.

  • Need we notice cluster-api or other kubeadm adapters about this?

vote to 2 and 3. 3 may be the right direction, but 2 is also a good choice to me.

we could just document it as our first choice (3), but if users really need better flexibility, we can think about alternatives like 2 or something else.

neolit123 avatar Apr 30 '24 09:04 neolit123

Need we notice cluster-api or other kubeadm adapters about this? vote to 2 and 3. 3 may be the right direction, but 2 is also a good choice to me. we could just document it as our first choice (3), but if users really need better flexibility, we can think about alternatives like 2 or something else.

3 is ok to go. +1 2 can be a later choice if many users really want that.

pacoxu avatar Apr 30 '24 10:04 pacoxu

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrischdi, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot avatar May 07 '24 14:05 k8s-ci-robot

/lgtm

neolit123 avatar May 07 '24 16:05 neolit123