cluster-api
cluster-api copied to clipboard
[e2e test flake] Timed waiting for all control plane replicas to be updated
Which jobs are flaking?
periodic-cluster-api-e2e-mink8s-main periodic-cluster-api-e2e-main
Which tests are flaking?
When testing KCP adoption Should adopt up-to-date control plane Machines without modification
Since when has it been flaking?
2024-11-05 is the first instance of this flake i can find: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-06&text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.periodic-cluster-api-e2e.
now it appears quite consistently: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-20&text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.periodic-cluster-api-e2e.
Testgrid link
https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main
Reason for failure (if possible)
TBD (have not had time to investigate thoroghly)
Anything else we need to know?
example failed job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-main/1858697443558821888
have filtered commits on main to around the time this flake popped up: https://github.com/kubernetes-sigs/cluster-api/commits/main/?since=2024-11-03&until=2024-11-05
have not have time to further investigate yet
Label(s) to be applied
/kind flake
/area e2e-testing
I confirm this flake. /assign
did we fix something related to this? I don't see any failure after 23-11 https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.-cluster-api-.&xjob=.-provider-.%7C.-operator-.
New occurencies now.
Seems not to be gone.
@sivchari 👋 are you still working on this?
Hi @cahillsf , I didn't have time to work on this. I'll try it since next week. Or should I unassign myself ? Thanks.
Hi @cahillsf , I didn't have time to work on this. I'll try it since next week. Or should I unassign myself ? Thanks.
if you have time please go ahead, even if you dont have a fix would be helpful to add findings. if youre not going to have time then feel free to unassign and we can find some assignee from the CI release team for this cycle
Hi, @cahillsf Once, I unassign me. /unassign
/help
@chrischdi: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
Don't know how helpful this will be, but I've spent some time looking into the most recent failure, logs here: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1878784832272601088
First my understanding of the basic structure of the test:
- create a cluster
- create a machine in that cluster with the proper naming to become a control plane
- then add a kubeadmcontrolplane object to the cluster
- Wait for the kubeadmcontrolplane object to be ready
- Wait for capi to realize that earlier machine would work for the cluster's kubeadmcontrolplane and take ownership of it
- delete the cluster.
Please correct me if I'm wrong or missing important minutia here.
So this test seems to be failing on the "wait for the kubeadmcontrolplane object to be ready" step. Here are the problems with the kcp object according to its status fields:
- Available is False due to "Waiting for etcd to report the list of members"
- EtcdClusterHealthy is Unknown due to "Waiting for Cluster spec.controlPlaneEndpoint to be set"
- ControlPlaneComponentsHealthy is Unknown due to "Waiting for Cluster spec.controlPlaneEndpoint to be set"
- Scaling up is True due to "Scaling up from 0 to 1 replicas"
The controlPlaneEndpoint issues are confusing to me as the controlPlaneEndpoint is set in all the objects that I know it should be set in, but I'm probably missing one or this is something local to the kcp object in a way I don't understand.
Were this a repeatable error, I'd try to get the e2e test to leave the machine around and then get onto it and investigate the cloud-init output to see if there are any errors there. However, as this is a flake that I can't force to occur locally I'm at a loss at how to proceed to debug. Looking for ideas/help.
Digging a bit into it:
KCP logs for that cluster:
❯ cat kcp.log| grep kcp-adoption --color | grep 'controller="kubeadmcontrolplane"' --color
I0113 13:03:56.425859 1 controller.go:185] "Cluster Controller has not yet set OwnerRef" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="3a38ac56-3e0e-42aa-a775-93765317ec57"
I0113 13:03:56.674880 1 controller.go:371] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="ba5fae1f-a1bf-43f1-a24b-6000f0dc2153" Cluster="kcp-adoption-hwn0kw/k
cp-adoption-0am4s3"
I0113 13:03:56.770301 1 controller.go:421] "Cluster does not yet have a ControlPlaneEndpoint defined" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="ba5fae1f-a1bf-43f1-a24b-6000f0dc2153" Clu
ster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
I0113 13:03:56.972648 1 controller.go:371] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="2b50f01b-20b8-444d-876e-76a51ab394ec" Cluster="kcp-adoption-hwn0kw/k
cp-adoption-0am4s3"
I0113 13:03:56.980947 1 controller.go:421] "Cluster does not yet have a ControlPlaneEndpoint defined" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="2b50f01b-20b8-444d-876e-76a51ab394ec" Cluster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
I0113 13:18:57.435618 1 controller.go:597] "Reconcile KubeadmControlPlane deletion" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="e3fdf5bc-0037-4da8-9c8b-479dea8c6f57" Cluster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
Timeline:
- ...
- 13:03:56.980947 : KCP Reconciles and logs
Cluster does not yet have a ControlPlaneEndpoint defined- KCP in this case returns in a way so it needs an event or resync to reconcile again. The happy path is to get an event for the Cluster object because the ControlPlaneEndpoint was added.
- 13:18:57.435618 : Next KCP reconcile with the deletion
So taking a look when .spec.controlPlaneEndpoint was added to the Cluster object:
- According to the Cluster objects managed fields, this was set at 2025-01-13T13:03:57Z, so after the above reconcile
This should emit an event to KCP leading to reconcile the KCP object and get back into the flow. But we don't get a Reconcile KubeadmControlPlane log line and also status does not get any update.
So I think KCP returns before even running this defer which would lead to status updates too: https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/controllers/controller.go#L212
The only relevant return above should be this one where we return because we wait for the adoption of Machines to happen (which is not the responsibility of the KCP controller): https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/controllers/controller.go#L209
To verify that: let's take a look at the Machine and especially the ownerReferences part:
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
name: kcp-adoption-0am4s3
uid: 46f2c0d3-626d-4dc5-a595-e6a81677825c
We only see a Cluster owner reference, but here we wait for adoptable machines which means (here) Machines which do not have a controller owner reference. This is the case here, an example of a controller owned KCP machine of a test-run which was green (xref):
ownerReferences:
- apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
name: kcp-adoption-doaliz
uid: df759ccf-2113-4eb5-9613-020aead2bf98
- apiVersion: controlplane.cluster.x-k8s.io/v1beta1
blockOwnerDeletion: true
controller: true
kind: KubeadmControlPlane
name: kcp-adoption-doaliz-control-plane
uid: 63b7645b-167e-46ca-a28c-a02562c422fb
So next step: why is that owner reference missing in the above example? It could be different reasons:
- We somehow missed the update event on the Cluster which did set the .spec.controlPlaneEndpoint stuff
- Or it reconciled but did return in some way that nothing happened (e.g. while trying to adopt?)
One idea to try to test it: Create a PR that:
- tries to only run this test multiple times in a row until it fails (and no other tests)
- once we have that:
- we could add some more logging or stuff on that PR to get a step further
- or dump resources more in different stages
- or increase log levels in CAPI or KCP controller to see what happened (or what did not happen)
Definetly not an easy one to figure out.
The only relevant return above should be this one where we return because we wait for the adoption of Machines to happen (which is not the responsibility of the KCP controller
@chrischdi are you sure about that it's not kcp's responsibility?
I think it is + the name of the test is
Should adopt up-to-date control plane Machines without modification
Generic un-dated search link for this error: https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn't%20stay%20stable&job=.-cluster-api-.e2e.&xjob=.-provider-.%7C.-operator-
This is still occurring frequently.
Hi , I see this flake in release 1.9 Earlier it was happening on main, is it fixed on main?
https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.-cluster-api-.&xjob=.-provider-.%7C.*-operator-
@arshadd-b
I see some (fairly) recent changes in more recent release to this specific ("Should adopt up-to-date control plane Machines without modification") test that are not in 1.9 branch:
- https://github.com/kubernetes-sigs/cluster-api/commit/cabafc38007bcb2991dd0038fd048c2366a0d2cd
- https://github.com/kubernetes-sigs/cluster-api/commit/ed4e6b28e87e45cfd8072b857529290f295366cf
- https://github.com/kubernetes-sigs/cluster-api/commit/671164eef28f901eb7cff7ceaca2d51aed377ea2
- https://github.com/kubernetes-sigs/cluster-api/commit/e7ca5f4c17a7273320101e43c3b5fd100c533515
Assuming that this is a test-code issue, only thing that I can see is this
This change basically https://github.com/kubernetes-sigs/cluster-api/commit/cabafc38007bcb2991dd0038fd048c2366a0d2cd
Does not look like a test code issue :)
Yeah doesn't look like related to test what I am thinking is we will check the logs for this from artifacts, if we can find something there or as mentioned in the comment https://github.com/kubernetes-sigs/cluster-api/issues/11454#issuecomment-2615910311 raising dummy pr and trying to reproduce and putting more logs.