cluster-api
cluster-api copied to clipboard
🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #
Fix https://github.com/kubernetes-sigs/cluster-api/issues/10093
Test
- Create 1 CP 1 Worker CAPI Cluster.
kubectl get cluster,kcp,machine -n default | grep hw-sks-test-unhealthy-cp
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp Provisioned 8m2s
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane hw-sks-test-unhealthy-cp true true 1 1 1 0 8m2s v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-9nbvm elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e Running 7m58s v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-node-l7ml5 elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282 Running 8m1s v1.25.15
- Update KCP by adding space between two names in
KCP.spec.kubeadmConfigSpec.clusterConfiguration.apiServer.extraArgs.tls-cipher-suites.
kubectl get cluster,kcp,machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME PHASE AGE VERSION
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp Provisioned 18m
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane hw-sks-test-unhealthy-cp true true 2 2 1 0 18m v1.25.15
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-9nbvm elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e Running 17m v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-lwm6b hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-lwm6b elf://ebacfd9a-fd18-488f-959a-35a4fe2275fe Running 7m14s v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-node-l7ml5 elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282 Running 17m v1.25.15
- Then the new CP Node will become ready, but its APIServer can not start and the APIServerPodHealthy condition with False status is added on the CP Machine. And the KCP won't become ready forever.
kubectl get machine hw-sks-test-unhealthy-cp-controlplane-lwm6b -n default -ojson | jq '.status.conditions'
[
{
"lastTransitionTime": "2024-02-26T10:20:01Z",
"status": "True",
"type": "Ready"
},
{
"lastTransitionTime": "2024-02-26T10:14:41Z",
"message": "CrashLoopBackOff",
"reason": "PodFailed",
"severity": "Error",
"status": "False",
"type": "APIServerPodHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:12:11Z",
"status": "True",
"type": "BootstrapReady"
},
{
"lastTransitionTime": "2024-02-26T10:14:06Z",
"status": "True",
"type": "ControllerManagerPodHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:14:09Z",
"status": "True",
"type": "EtcdMemberHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:14:07Z",
"status": "True",
"type": "EtcdPodHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:20:01Z",
"status": "True",
"type": "InfrastructureReady"
},
{
"lastTransitionTime": "2024-02-26T10:14:24Z",
"status": "True",
"type": "NodeHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:15:24Z",
"status": "True",
"type": "SchedulerPodHealthy"
}
]
kubectl get -n default kcp hw-sks-test-unhealthy-cp-controlplane -ojson | jq '.status.conditions'
[
{
"lastTransitionTime": "2024-02-26T10:12:12Z",
"message": "Rolling 1 replicas with outdated spec (1 replicas up to date)",
"reason": "RollingUpdateInProgress",
"severity": "Warning",
"status": "False",
"type": "Ready"
},
{
"lastTransitionTime": "2024-02-26T10:03:01Z",
"status": "True",
"type": "Available"
},
{
"lastTransitionTime": "2024-02-26T10:01:27Z",
"status": "True",
"type": "CertificatesAvailable"
},
{
"lastTransitionTime": "2024-02-26T10:14:10Z",
"message": "Following machines are reporting control plane errors: hw-sks-test-unhealthy-cp-controlplane-lwm6b",
"reason": "ControlPlaneComponentsUnhealthy",
"severity": "Error",
"status": "False",
"type": "ControlPlaneComponentsHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:14:10Z",
"status": "True",
"type": "EtcdClusterHealthy"
},
{
"lastTransitionTime": "2024-02-26T10:01:48Z",
"status": "True",
"type": "MachinesCreated"
},
{
"lastTransitionTime": "2024-02-26T10:26:33Z",
"status": "True",
"type": "MachinesReady"
},
{
"lastTransitionTime": "2024-02-26T10:12:12Z",
"message": "Rolling 1 replicas with outdated spec (1 replicas up to date)",
"reason": "RollingUpdateInProgress",
"severity": "Warning",
"status": "False",
"type": "MachinesSpecUpToDate"
},
{
"lastTransitionTime": "2024-02-26T10:12:12Z",
"message": "Scaling down control plane to 1 replicas (actual 2)",
"reason": "ScalingDown",
"severity": "Warning",
"status": "False",
"type": "Resized"
}
]
- Update KCP, delete the spaces previously added in the
KCP.spec.kubeadmConfigSpec.clusterConfiguration.apiServer.extraArgs.tls-cipher-suitesand delete any one of suites configuration to ensure that the current configuration is different from the one originally created. - First delete the CP Machine that is abnormal and outdated.
kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-9nbvm elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e Running 34m v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-lwm6b hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-lwm6b elf://ebacfd9a-fd18-488f-959a-35a4fe2275fe Deleting 24m v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-node-l7ml5 elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282 Running 34m v1.25.15
- Then create a new CP Machine.
kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
hw-sks-test-unhealthy-cp-controlplane-9nbvm hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-9nbvm elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e Running 37m v1.25.15
hw-sks-test-unhealthy-cp-controlplane-d8ffd hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-d8ffd elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e Running 2m9s v1.25.15
hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-node-l7ml5 elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282 Running 37m v1.25.15
- Finally delete the machine that is in Ready state but outdated.
kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks
-test-unhealthy-cp -w
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
hw-sks-test-unhealthy-cp-controlplane-9nbvm hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-9nbvm elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e Deleting 40m v1.25.15
hw-sks-test-unhealthy-cp-controlplane-d8ffd hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-d8ffd elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e Running 4m37s v1.25.15
hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-node-l7ml5 elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282 Running 40m v1.25.15
- Cluster, KCP, Machines are Ready.
kubectl get cluster,kcp,machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME PHASE AGE VERSION
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp Provisioned 41m
NAME CLUSTER INITIALIZED API SERVER AVAILABLE REPLICAS READY UPDATED UNAVAILABLE AGE VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane hw-sks-test-unhealthy-cp true true 1 1 1 0 41m v1.25.15
NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-d8ffd hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-controlplane-d8ffd elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e Running 6m18s v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp hw-sks-test-unhealthy-cp hw-sks-test-unhealthy-cp-node-l7ml5 elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282 Running 41m v1.25.15
/area provider/control-plane-kubeadm
Hi @Levi080513. Thanks for your PR.
I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.
Once the patch is verified, the new status will be reflected by the ok-to-test label.
I understand the commands that are listed here.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/assign
cc @vincepri @enxebre for an opinion
LGTM! Waiting for CAPI team to take a look. Thanks!
LGTM label has been added.
/approve
/approve cancel /assign @fabriziopandini
@fabriziopandini It would be great if you have time to review the PR. 😄
Thank you, nice fix!!
Ready from my side
/lgtm
/assign @fabriziopandini
LGTM label has been added.
squash the commits please. 1 can be updating proposal, 2 can be the code changes.
I would just squash the commits on merge via tide. Avoids potential mistakes during squash. (but fine either way for me of course)
Thanks @sbueringer @neolit123.
+1 on this if it can do well, and the PR title should be the final squashed commit msg? The commits and review history is also reserved.
I would just squash the commits on merge via tide. Avoids potential mistakes during squash.
/assign @fabriziopandini
Great work and thank you for taking care of all our comments, appreciated /lgtm
LGTM label has been added.
Thank you very much!!
/approve
/cherry-pick release-1.7
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.7 in a new PR and assign it to you.
In response to this:
/cherry-pick release-1.7
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/cherry-pick release-1.6
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: sbueringer
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [sbueringer]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.6 in a new PR and assign it to you.
In response to this:
/cherry-pick release-1.6
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/cherry-pick release-1.5
@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.
In response to this:
/cherry-pick release-1.5
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Let's see if we get lucky with the automated cherry-picks
@sbueringer: new pull request created: #10421
In response to this:
/cherry-pick release-1.7
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@sbueringer: #10196 failed to apply on top of branch "release-1.6":
Applying: Prioritize deletion of abnormal outdated CP machines when scaling down KCP
Using index info to reconstruct a base tree...
M controlplane/kubeadm/internal/control_plane.go
M controlplane/kubeadm/internal/controllers/remediation_test.go
M controlplane/kubeadm/internal/controllers/scale.go
M controlplane/kubeadm/internal/controllers/scale_test.go
M docs/proposals/20191017-kubeadm-based-control-plane.md
M util/collections/machine_filters.go
Falling back to patching base and 3-way merge...
Auto-merging util/collections/machine_filters.go
CONFLICT (content): Merge conflict in util/collections/machine_filters.go
Auto-merging docs/proposals/20191017-kubeadm-based-control-plane.md
Auto-merging controlplane/kubeadm/internal/controllers/scale_test.go
Auto-merging controlplane/kubeadm/internal/controllers/scale.go
Auto-merging controlplane/kubeadm/internal/controllers/remediation_test.go
Auto-merging controlplane/kubeadm/internal/control_plane.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Prioritize deletion of abnormal outdated CP machines when scaling down KCP
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
In response to this:
/cherry-pick release-1.6
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@sbueringer: #10196 failed to apply on top of branch "release-1.5":
Applying: Prioritize deletion of abnormal outdated CP machines when scaling down KCP
Using index info to reconstruct a base tree...
M controlplane/kubeadm/internal/control_plane.go
M controlplane/kubeadm/internal/controllers/remediation_test.go
M controlplane/kubeadm/internal/controllers/scale.go
M controlplane/kubeadm/internal/controllers/scale_test.go
M docs/proposals/20191017-kubeadm-based-control-plane.md
M util/collections/machine_filters.go
Falling back to patching base and 3-way merge...
Auto-merging util/collections/machine_filters.go
CONFLICT (content): Merge conflict in util/collections/machine_filters.go
Auto-merging docs/proposals/20191017-kubeadm-based-control-plane.md
Auto-merging controlplane/kubeadm/internal/controllers/scale_test.go
Auto-merging controlplane/kubeadm/internal/controllers/scale.go
Auto-merging controlplane/kubeadm/internal/controllers/remediation_test.go
Auto-merging controlplane/kubeadm/internal/control_plane.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Prioritize deletion of abnormal outdated CP machines when scaling down KCP
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
In response to this:
/cherry-pick release-1.5
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
Thanks @sbueringer @fabriziopandini a lot for your review and patience! The auto cherry-pick failed for release-1.6 and 1.5. My team member @Levi080513 can create new PR for release-1.6 separatly if needed.
Sounds good! Feel free to go ahead with the cherry-pick(s) (also to 1.5, if you want to, if you don't need it there it's fine to only cherry-pick into 1.6)