Consider if to add more on KCP checks on machine status
What would you like to be added (User Story)?
As an operator, I need as much as possible info about the status of my control plane machines
Detailed Description
KCP implements checks for components status on controlled machines: API Server, Scheduler, Controller Manager, etcd
We should consider if to add more to the list of components, e.g. kube-proxy, kubelet (node). This could provide an interesting signal to catch problems that could happen during upgrades when e.g. the control plane comes up because it is implemented via static pod, but other "regular" pods scheduled on the control plane node might not come up.
Anything else you would like to add?
No response
Label(s) to be applied
/kind feature /area provider/control-plane-kubeadm
/triage accepted
I think this is important. When we discovered this issue: https://github.com/kubernetes-sigs/cluster-api/pull/10947 KCP was updating through the entire control plane and during the control plane upgrade only static pods came up (i.e. kube-proxy and CNI didn't). I.e. not even the Nodes became ready.
I think this is pretty dangerous behavior, worst case it makes the entire control plane unavailable. So it would be great to have some more checks to safeguard against this.
/help
This requires a little bit of research to figure out what kind of checks can be done (and we have to be careful that we don't break existing working upgrade flows).
@sbueringer: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
This requires a little bit of research to figure out what kind of checks can be done (and we have to be careful that we don't break existing working upgrade flows).
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted Still something worth to in investigate. But we should be careful of maintaining a sane balance between what we observe and the number of un-cached API calls to the remote cluster
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted /priority important-longterm
Hi @sbueringer @fabriziopandini, this is a pretty interesting issue and I would like to work on it.
Is this still considered for investigation? If yes, can you please share any pointers on getting started? Thanks!
@Amulyam24 sorry but I'm not sure when I can have time to research to figure out what kind of checks can be done (and we have to be careful that we don't break existing working upgrade flows). If someone else can step in before me, feel free to go
I faced this issue earlier this week, where my control-plane got rolled out entirely although the CNI was not ready (image couldn't be pulled as quay.io was down). Luckily I noticed it and could pause the upgrade in time, but that's definitely a very dangerous behavior.
I don't recall seeing this in the past, could be due to the fact conditions evolved over time.
I personally think it's better to block an upgrade by mistake (false positive) rather than the opposite.
From what I see in the conditions reported by machines, the Ready condition looks to be the most appropriate as it summarizes what's reported in other conditions (e.g. infra not ready, CNI not initialized, etc).
I also want to improve this, I'm mostly concerned about potentially breaking users by introducing checks that can lead to deadlocks. So we have to make sure that we get this right
Another data point: I updated the KubeadmControlPlane resource, and KCP rolled out new Machines. However, I made a mistake in the configuration, and on the new Machines, the API server could not authenticate with my CNI.
This meant that every new API server was accepting requests, and reaching etcd, so considered healthy by kubeadm, and by KCP. However, every new API server could not reach any mutating or validating webhooks.
When the API server cannot reach the Cluster API webhooks, users cannot create or update Cluster API resources.
To fix my configuration mistake, I needed to update the KubeadmControlPlane. Because the API servers could not reach the webhooks, I could not.
(I was able to recover from this situation by deleting the Cluster API webhook configuration resources, and updating KubeadmControlPlane, but bypassing webhooks like this is dangerous, and should be avoided.)
In light of this, I think we should take under consideration the idea that a control plane Machine is healthy if and only if the API server can reach the Cluster API mutating/validating webhooks.
I'd like to work on this.
I'm mostly concerned about potentially breaking users by introducing checks that can lead to deadlocks.
Let's build a shared understanding of what deadlocks are possible.
I'll draft a diagram of the dependencies.