cluster-api A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time

What steps did you take and what happened?

The steps below were done manually as part of a high availability test:

Create a 5 control-plane nodes, 0 (zero) worker nodes
Drop two of them after they finish to reconcile and fully running - e.g. by powering off and destroying their VMs
Machine healthcheck configured and act fast by identifying the missing nodes
Using vSphere client

KCP seems to start the reconciliation of the first node and stuck, without finishing and without starting the remediation of the second one.

Some related logs:

I0119 12:48:50.143423       1 remediation.go:424] "etcd cluster projected after remediation of test-cluster1-srskl" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=1d51dec1-3547-42d1-bdec-a2ebad24afb2 Cluster="default/test-cluster1" healthyMembers=[test-cluster1-xcb42 (test-cluster1-xcb42) test-cluster1-vw754 (test-cluster1-vw754) test-cluster1-4rsnz (test-cluster1-4rsnz)] unhealthyMembers=[test-cluster1-cqd5t (test-cluster1-cqd5t)] targetTotalMembers=4 targetQuorum=3 targetUnhealthyMembers=1 canSafelyRemediate=true

I0119 12:49:39.391449       1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=7a7a1216-0554-461c-b6a1-35704e964132 Cluster="default/test-cluster1" failures="[Machine test-cluster1-cqd5t reports APIServerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports ControllerManagerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports SchedulerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the test-cluster1-cqd5t node: could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded)]"

I0119 12:49:56.036901       1 remediation.go:101] "Another remediation is already in progress. Skipping remediation." controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=e60ffc9a-6c94-4f20-9451-4967235caddf Cluster="default/test-cluster1" Machine="default/test-cluster1-cqd5

What did you expect to happen?

A 5 control-plane nodes cluster can loose two of them. I expected that the reconciliation finishes successfully and the cluster be recovered.

Cluster API version

v1.5.4

Kubernetes version

v1.27.7

Anything else you would like to add?

No response

Label(s) to be applied

/kind bug /area control-plane

Feb 08 '24 19:02 jcmoraisjr

This issue is currently awaiting triage.

CAPI contributors will take a look as soon as possible, apply one of the triage/* labels and provide further guidance.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Feb 08 '24 19:02 k8s-ci-robot

/priority important-soon

Apr 11 '24 17:04 fabriziopandini

cluster-api cluster-api copied to clipboard

A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time

What steps did you take and what happened?

What did you expect to happen?

Cluster API version

Kubernetes version

Anything else you would like to add?

Label(s) to be applied

cluster-api
cluster-api copied to clipboard