cluster-api Consider if to add more on KCP checks on machine status

What would you like to be added (User Story)?

As an operator, I need as much as possible info about the status of my control plane machines

Detailed Description

KCP implements checks for components status on controlled machines: API Server, Scheduler, Controller Manager, etcd

We should consider if to add more to the list of components, e.g. kube-proxy, kubelet (node). This could provide an interesting signal to catch problems that could happen during upgrades when e.g. the control plane comes up because it is implemented via static pod, but other "regular" pods scheduled on the control plane node might not come up.

Anything else you would like to add?

No response

Label(s) to be applied

/kind feature /area provider/control-plane-kubeadm

Jul 26 '24 15:07 fabriziopandini

/triage accepted

I think this is important. When we discovered this issue: https://github.com/kubernetes-sigs/cluster-api/pull/10947 KCP was updating through the entire control plane and during the control plane upgrade only static pods came up (i.e. kube-proxy and CNI didn't). I.e. not even the Nodes became ready.

I think this is pretty dangerous behavior, worst case it makes the entire control plane unavailable. So it would be great to have some more checks to safeguard against this.

Jul 29 '24 09:07 sbueringer

/help

This requires a little bit of research to figure out what kind of checks can be done (and we have to be careful that we don't break existing working upgrade flows).

Aug 28 '24 14:08 sbueringer

@sbueringer: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

This requires a little bit of research to figure out what kind of checks can be done (and we have to be careful that we don't break existing working upgrade flows).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Aug 28 '24 14:08 k8s-ci-robot

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Nov 26 '24 14:11 k8s-triage-robot

/triage accepted Still something worth to in investigate. But we should be careful of maintaining a sane balance between what we observe and the number of un-cached API calls to the remote cluster

Nov 27 '24 13:11 fabriziopandini

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Feb 25 '25 13:02 k8s-triage-robot

/triage accepted /priority important-longterm

Feb 25 '25 14:02 fabriziopandini

Hi @sbueringer @fabriziopandini, this is a pretty interesting issue and I would like to work on it.

Is this still considered for investigation? If yes, can you please share any pointers on getting started? Thanks!

Jul 23 '25 14:07 Amulyam24

@Amulyam24 sorry but I'm not sure when I can have time to research to figure out what kind of checks can be done (and we have to be careful that we don't break existing working upgrade flows). If someone else can step in before me, feel free to go

Jul 25 '25 14:07 fabriziopandini

I faced this issue earlier this week, where my control-plane got rolled out entirely although the CNI was not ready (image couldn't be pulled as quay.io was down). Luckily I noticed it and could pause the upgrade in time, but that's definitely a very dangerous behavior.

I don't recall seeing this in the past, could be due to the fact conditions evolved over time.

I personally think it's better to block an upgrade by mistake (false positive) rather than the opposite. From what I see in the conditions reported by machines, the Ready condition looks to be the most appropriate as it summarizes what's reported in other conditions (e.g. infra not ready, CNI not initialized, etc).

Oct 23 '25 08:10 fad3t

I also want to improve this, I'm mostly concerned about potentially breaking users by introducing checks that can lead to deadlocks. So we have to make sure that we get this right

Oct 23 '25 08:10 sbueringer

Another data point: I updated the KubeadmControlPlane resource, and KCP rolled out new Machines. However, I made a mistake in the configuration, and on the new Machines, the API server could not authenticate with my CNI.

This meant that every new API server was accepting requests, and reaching etcd, so considered healthy by kubeadm, and by KCP. However, every new API server could not reach any mutating or validating webhooks.

When the API server cannot reach the Cluster API webhooks, users cannot create or update Cluster API resources.

To fix my configuration mistake, I needed to update the KubeadmControlPlane. Because the API servers could not reach the webhooks, I could not.

(I was able to recover from this situation by deleting the Cluster API webhook configuration resources, and updating KubeadmControlPlane, but bypassing webhooks like this is dangerous, and should be avoided.)

In light of this, I think we should take under consideration the idea that a control plane Machine is healthy if and only if the API server can reach the Cluster API mutating/validating webhooks.

Dec 03 '25 18:12 dlipovetsky

I'd like to work on this.

I'm mostly concerned about potentially breaking users by introducing checks that can lead to deadlocks.

Let's build a shared understanding of what deadlocks are possible.

I'll draft a diagram of the dependencies.

Dec 03 '25 18:12 dlipovetsky