cluster-api 🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #

Fix https://github.com/kubernetes-sigs/cluster-api/issues/10093

Test

Create 1 CP 1 Worker CAPI Cluster.

kubectl get cluster,kcp,machine -n default | grep hw-sks-test-unhealthy-cp
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp   Provisioned   8m2s   
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane   hw-sks-test-unhealthy-cp   true          true                   1          1       1         0             8m2s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm             hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running        7m58s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp    hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running        8m1s    v1.25.15

Update KCP by adding space between two names in KCP.spec.kubeadmConfigSpec.clusterConfiguration.apiServer.extraArgs.tls-cipher-suites.

kubectl get cluster,kcp,machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME                                                PHASE         AGE   VERSION
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp   Provisioned   18m   

NAME                                                                                      CLUSTER                    INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane   hw-sks-test-unhealthy-cp   true          true                   2          2       1         0             18m   v1.25.15

NAME                                                                            CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE     AGE     VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running   17m     v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-lwm6b            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-lwm6b   elf://ebacfd9a-fd18-488f-959a-35a4fe2275fe   Running   7m14s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running   17m     v1.25.15

Then the new CP Node will become ready, but its APIServer can not start and the APIServerPodHealthy condition with False status is added on the CP Machine. And the KCP won't become ready forever.

kubectl get  machine hw-sks-test-unhealthy-cp-controlplane-lwm6b -n default -ojson | jq '.status.conditions'
[
  {
    "lastTransitionTime": "2024-02-26T10:20:01Z",
    "status": "True",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:41Z",
    "message": "CrashLoopBackOff",
    "reason": "PodFailed",
    "severity": "Error",
    "status": "False",
    "type": "APIServerPodHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:12:11Z",
    "status": "True",
    "type": "BootstrapReady"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:06Z",
    "status": "True",
    "type": "ControllerManagerPodHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:09Z",
    "status": "True",
    "type": "EtcdMemberHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:07Z",
    "status": "True",
    "type": "EtcdPodHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:20:01Z",
    "status": "True",
    "type": "InfrastructureReady"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:24Z",
    "status": "True",
    "type": "NodeHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:15:24Z",
    "status": "True",
    "type": "SchedulerPodHealthy"
  }
]

kubectl get -n default kcp hw-sks-test-unhealthy-cp-controlplane -ojson | jq '.status.conditions'
[
  {
    "lastTransitionTime": "2024-02-26T10:12:12Z",
    "message": "Rolling 1 replicas with outdated spec (1 replicas up to date)",
    "reason": "RollingUpdateInProgress",
    "severity": "Warning",
    "status": "False",
    "type": "Ready"
  },
  {
    "lastTransitionTime": "2024-02-26T10:03:01Z",
    "status": "True",
    "type": "Available"
  },
  {
    "lastTransitionTime": "2024-02-26T10:01:27Z",
    "status": "True",
    "type": "CertificatesAvailable"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:10Z",
    "message": "Following machines are reporting control plane errors: hw-sks-test-unhealthy-cp-controlplane-lwm6b",
    "reason": "ControlPlaneComponentsUnhealthy",
    "severity": "Error",
    "status": "False",
    "type": "ControlPlaneComponentsHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:14:10Z",
    "status": "True",
    "type": "EtcdClusterHealthy"
  },
  {
    "lastTransitionTime": "2024-02-26T10:01:48Z",
    "status": "True",
    "type": "MachinesCreated"
  },
  {
    "lastTransitionTime": "2024-02-26T10:26:33Z",
    "status": "True",
    "type": "MachinesReady"
  },
  {
    "lastTransitionTime": "2024-02-26T10:12:12Z",
    "message": "Rolling 1 replicas with outdated spec (1 replicas up to date)",
    "reason": "RollingUpdateInProgress",
    "severity": "Warning",
    "status": "False",
    "type": "MachinesSpecUpToDate"
  },
  {
    "lastTransitionTime": "2024-02-26T10:12:12Z",
    "message": "Scaling down control plane to 1 replicas (actual 2)",
    "reason": "ScalingDown",
    "severity": "Warning",
    "status": "False",
    "type": "Resized"
  }
]

Update KCP, delete the spaces previously added in the KCP.spec.kubeadmConfigSpec.clusterConfiguration.apiServer.extraArgs.tls-cipher-suites and delete any one of suites configuration to ensure that the current configuration is different from the one originally created.
First delete the CP Machine that is abnormal and outdated.

kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME                                                                            CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE      AGE   VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running    34m   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-lwm6b            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-lwm6b   elf://ebacfd9a-fd18-488f-959a-35a4fe2275fe   Deleting   24m   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running    34m   v1.25.15

Then create a new CP Machine.

 kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp 
NAME                                                   CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE     AGE    VERSION
hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Running   37m    v1.25.15
hw-sks-test-unhealthy-cp-controlplane-d8ffd            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-d8ffd   elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e   Running   2m9s   v1.25.15
hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running   37m    v1.25.15

Finally delete the machine that is in Ready state but outdated.

kubectl get machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks
-test-unhealthy-cp -w
NAME                                                   CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE      AGE     VERSION
hw-sks-test-unhealthy-cp-controlplane-9nbvm            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-9nbvm   elf://f688b268-0f09-4e3b-bcfe-b8cda710ab6e   Deleting   40m     v1.25.15
hw-sks-test-unhealthy-cp-controlplane-d8ffd            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-d8ffd   elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e   Running    4m37s   v1.25.15
hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running    40m     v1.25.15

Cluster, KCP, Machines are Ready.

kubectl get cluster,kcp,machine -n default -l cluster.x-k8s.io/cluster-name=hw-sks-test-unhealthy-cp
NAME                                                PHASE         AGE   VERSION
cluster.cluster.x-k8s.io/hw-sks-test-unhealthy-cp   Provisioned   41m   

NAME                                                                                      CLUSTER                    INITIALIZED   API SERVER AVAILABLE   REPLICAS   READY   UPDATED   UNAVAILABLE   AGE   VERSION
kubeadmcontrolplane.controlplane.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane   hw-sks-test-unhealthy-cp   true          true                   1          1       1         0             41m   v1.25.15

NAME                                                                            CLUSTER                    NODENAME                                      PROVIDERID                                   PHASE     AGE     VERSION
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-controlplane-d8ffd            hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-controlplane-d8ffd   elf://955ce3f7-3fde-4119-a885-09fc3ccd4e6e   Running   6m18s   v1.25.15
machine.cluster.x-k8s.io/hw-sks-test-unhealthy-cp-node-5769b6799cxkxcg9-5jtzp   hw-sks-test-unhealthy-cp   hw-sks-test-unhealthy-cp-node-l7ml5           elf://3d3a4a5e-ec32-4b49-a30a-52b00f972282   Running   41m     v1.25.15

/area provider/control-plane-kubeadm

Feb 26 '24 10:02 Levi080513

Hi @Levi080513. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 26 '24 10:02 k8s-ci-robot

/assign

cc @vincepri @enxebre for an opinion

Feb 29 '24 13:02 fabriziopandini

LGTM! Waiting for CAPI team to take a look. Thanks!

Mar 01 '24 02:03 jessehu

LGTM label has been added.

Git tree hash: 2166dbedad86ce72d7e21481bf3b8e73d5003793

Mar 07 '24 15:03 k8s-ci-robot

/approve

Mar 08 '24 14:03 vincepri

/approve cancel /assign @fabriziopandini

Mar 08 '24 14:03 vincepri

@fabriziopandini It would be great if you have time to review the PR. 😄

Mar 13 '24 12:03 Levi080513

Thank you, nice fix!!

Ready from my side

/lgtm

/assign @fabriziopandini

Apr 05 '24 14:04 sbueringer

LGTM label has been added.

Git tree hash: faded8dcc0035c243e0158dca4601d3c3aab60da

Apr 05 '24 14:04 k8s-ci-robot

squash the commits please. 1 can be updating proposal, 2 can be the code changes.

I would just squash the commits on merge via tide. Avoids potential mistakes during squash. (but fine either way for me of course)

Apr 05 '24 14:04 sbueringer

Thanks @sbueringer @neolit123.
+1 on this if it can do well, and the PR title should be the final squashed commit msg? The commits and review history is also reserved.

I would just squash the commits on merge via tide. Avoids potential mistakes during squash.

Apr 06 '24 00:04 jessehu

/assign @fabriziopandini

Apr 08 '24 18:04 sbueringer

Great work and thank you for taking care of all our comments, appreciated /lgtm

Apr 11 '24 09:04 fabriziopandini

LGTM label has been added.

Git tree hash: 9734518eccbaf837f4c5e0ba1127232c4bbca143

Apr 11 '24 09:04 k8s-ci-robot

Thank you very much!!

/approve

Apr 11 '24 11:04 sbueringer

/cherry-pick release-1.7

Apr 11 '24 11:04 sbueringer

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.7 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 11:04 k8s-infra-cherrypick-robot

/cherry-pick release-1.6

Apr 11 '24 11:04 sbueringer

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sbueringer

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sbueringer]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Apr 11 '24 11:04 k8s-ci-robot

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 11:04 k8s-infra-cherrypick-robot

/cherry-pick release-1.5

Apr 11 '24 11:04 sbueringer

@sbueringer: once the present PR merges, I will cherry-pick it on top of release-1.5 in a new PR and assign it to you.

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 11:04 k8s-infra-cherrypick-robot

Let's see if we get lucky with the automated cherry-picks

Apr 11 '24 11:04 sbueringer

@sbueringer: new pull request created: #10421

In response to this:

/cherry-pick release-1.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 11:04 k8s-infra-cherrypick-robot

@sbueringer: #10196 failed to apply on top of branch "release-1.6":

Applying: Prioritize deletion of abnormal outdated CP machines when scaling down KCP
Using index info to reconstruct a base tree...
M	controlplane/kubeadm/internal/control_plane.go
M	controlplane/kubeadm/internal/controllers/remediation_test.go
M	controlplane/kubeadm/internal/controllers/scale.go
M	controlplane/kubeadm/internal/controllers/scale_test.go
M	docs/proposals/20191017-kubeadm-based-control-plane.md
M	util/collections/machine_filters.go
Falling back to patching base and 3-way merge...
Auto-merging util/collections/machine_filters.go
CONFLICT (content): Merge conflict in util/collections/machine_filters.go
Auto-merging docs/proposals/20191017-kubeadm-based-control-plane.md
Auto-merging controlplane/kubeadm/internal/controllers/scale_test.go
Auto-merging controlplane/kubeadm/internal/controllers/scale.go
Auto-merging controlplane/kubeadm/internal/controllers/remediation_test.go
Auto-merging controlplane/kubeadm/internal/control_plane.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Prioritize deletion of abnormal outdated CP machines when scaling down KCP
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 11:04 k8s-infra-cherrypick-robot

@sbueringer: #10196 failed to apply on top of branch "release-1.5":

Applying: Prioritize deletion of abnormal outdated CP machines when scaling down KCP
Using index info to reconstruct a base tree...
M	controlplane/kubeadm/internal/control_plane.go
M	controlplane/kubeadm/internal/controllers/remediation_test.go
M	controlplane/kubeadm/internal/controllers/scale.go
M	controlplane/kubeadm/internal/controllers/scale_test.go
M	docs/proposals/20191017-kubeadm-based-control-plane.md
M	util/collections/machine_filters.go
Falling back to patching base and 3-way merge...
Auto-merging util/collections/machine_filters.go
CONFLICT (content): Merge conflict in util/collections/machine_filters.go
Auto-merging docs/proposals/20191017-kubeadm-based-control-plane.md
Auto-merging controlplane/kubeadm/internal/controllers/scale_test.go
Auto-merging controlplane/kubeadm/internal/controllers/scale.go
Auto-merging controlplane/kubeadm/internal/controllers/remediation_test.go
Auto-merging controlplane/kubeadm/internal/control_plane.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Prioritize deletion of abnormal outdated CP machines when scaling down KCP
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Apr 11 '24 11:04 k8s-infra-cherrypick-robot

Thanks @sbueringer @fabriziopandini a lot for your review and patience! The auto cherry-pick failed for release-1.6 and 1.5. My team member @Levi080513 can create new PR for release-1.6 separatly if needed.

Apr 11 '24 11:04 jessehu

Sounds good! Feel free to go ahead with the cherry-pick(s) (also to 1.5, if you want to, if you don't need it there it's fine to only cherry-pick into 1.6)

Apr 11 '24 12:04 sbueringer

cluster-api cluster-api copied to clipboard

🐛 Delete out of date machines with unhealthy control plane component conditions when rolling out KCP

cluster-api
cluster-api copied to clipboard