cluster-api
cluster-api copied to clipboard
A 5 control-plane nodes cluster does not recover if loosing 2 nodes at the same time
What steps did you take and what happened?
The steps below were done manually as part of a high availability test:
- Create a 5 control-plane nodes, 0 (zero) worker nodes
- Drop two of them after they finish to reconcile and fully running - e.g. by powering off and destroying their VMs
- Machine healthcheck configured and act fast by identifying the missing nodes
- Using vSphere client
KCP seems to start the reconciliation of the first node and stuck, without finishing and without starting the remediation of the second one.
Some related logs:
I0119 12:48:50.143423 1 remediation.go:424] "etcd cluster projected after remediation of test-cluster1-srskl" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=1d51dec1-3547-42d1-bdec-a2ebad24afb2 Cluster="default/test-cluster1" healthyMembers=[test-cluster1-xcb42 (test-cluster1-xcb42) test-cluster1-vw754 (test-cluster1-vw754) test-cluster1-4rsnz (test-cluster1-4rsnz)] unhealthyMembers=[test-cluster1-cqd5t (test-cluster1-cqd5t)] targetTotalMembers=4 targetQuorum=3 targetUnhealthyMembers=1 canSafelyRemediate=true
I0119 12:49:39.391449 1 scale.go:204] "Waiting for control plane to pass preflight checks" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=7a7a1216-0554-461c-b6a1-35704e964132 Cluster="default/test-cluster1" failures="[Machine test-cluster1-cqd5t reports APIServerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports ControllerManagerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports SchedulerPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdPodHealthy condition is false (Error, Missing node), Machine test-cluster1-cqd5t reports EtcdMemberHealthy condition is unknown (Failed to connect to the etcd pod on the test-cluster1-cqd5t node: could not establish a connection to any etcd node: unable to create etcd client: context deadline exceeded)]"
I0119 12:49:56.036901 1 remediation.go:101] "Another remediation is already in progress. Skipping remediation." controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="default/test-cluster1" namespace="default" name="test-cluster1" reconcileID=e60ffc9a-6c94-4f20-9451-4967235caddf Cluster="default/test-cluster1" Machine="default/test-cluster1-cqd5
What did you expect to happen?
A 5 control-plane nodes cluster can loose two of them. I expected that the reconciliation finishes successfully and the cluster be recovered.
Cluster API version
v1.5.4
Kubernetes version
v1.27.7
Anything else you would like to add?
No response
Label(s) to be applied
/kind bug /area control-plane
This issue is currently awaiting triage.
CAPI contributors will take a look as soon as possible, apply one of the triage/*
labels and provide further guidance.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
/priority important-soon