cluster-api [e2e test flake] Timed waiting for all control plane replicas to be updated

Which jobs are flaking?

periodic-cluster-api-e2e-mink8s-main periodic-cluster-api-e2e-main

Which tests are flaking?

When testing KCP adoption Should adopt up-to-date control plane Machines without modification

Since when has it been flaking?

2024-11-05 is the first instance of this flake i can find: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-06&text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.periodic-cluster-api-e2e.

now it appears quite consistently: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-20&text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.periodic-cluster-api-e2e.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

TBD (have not had time to investigate thoroghly)

Anything else we need to know?

example failed job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-main/1858697443558821888

have filtered commits on main to around the time this flake popped up: https://github.com/kubernetes-sigs/cluster-api/commits/main/?since=2024-11-03&until=2024-11-05

have not have time to further investigate yet

Label(s) to be applied

/kind flake

Nov 20 '24 18:11 cahillsf

/area e2e-testing

Nov 20 '24 18:11 cahillsf

I confirm this flake. /assign

Nov 21 '24 04:11 sivchari

did we fix something related to this? I don't see any failure after 23-11 https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.-cluster-api-.&xjob=.-provider-.%7C.-operator-.

Nov 26 '24 08:11 adilGhaffarDev

New occurencies now.

Seems not to be gone.

Nov 27 '24 13:11 chrischdi

@sivchari 👋 are you still working on this?

Jan 09 '25 20:01 cahillsf

Hi @cahillsf , I didn't have time to work on this. I'll try it since next week. Or should I unassign myself ? Thanks.

Jan 10 '25 04:01 sivchari

Hi @cahillsf , I didn't have time to work on this. I'll try it since next week. Or should I unassign myself ? Thanks.

if you have time please go ahead, even if you dont have a fix would be helpful to add findings. if youre not going to have time then feel free to unassign and we can find some assignee from the CI release team for this cycle

Jan 13 '25 18:01 cahillsf

Hi, @cahillsf Once, I unassign me. /unassign

Jan 15 '25 01:01 sivchari

/help

Jan 22 '25 14:01 chrischdi

@chrischdi: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jan 22 '25 14:01 k8s-ci-robot

Don't know how helpful this will be, but I've spent some time looking into the most recent failure, logs here: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1878784832272601088

First my understanding of the basic structure of the test:

create a cluster
create a machine in that cluster with the proper naming to become a control plane
then add a kubeadmcontrolplane object to the cluster
Wait for the kubeadmcontrolplane object to be ready
Wait for capi to realize that earlier machine would work for the cluster's kubeadmcontrolplane and take ownership of it
delete the cluster.

Please correct me if I'm wrong or missing important minutia here.

So this test seems to be failing on the "wait for the kubeadmcontrolplane object to be ready" step. Here are the problems with the kcp object according to its status fields:

Available is False due to "Waiting for etcd to report the list of members"
EtcdClusterHealthy is Unknown due to "Waiting for Cluster spec.controlPlaneEndpoint to be set"
ControlPlaneComponentsHealthy is Unknown due to "Waiting for Cluster spec.controlPlaneEndpoint to be set"
Scaling up is True due to "Scaling up from 0 to 1 replicas"

The controlPlaneEndpoint issues are confusing to me as the controlPlaneEndpoint is set in all the objects that I know it should be set in, but I'm probably missing one or this is something local to the kcp object in a way I don't understand.

Were this a repeatable error, I'd try to get the e2e test to leave the machine around and then get onto it and investigate the cloud-init output to see if there are any errors there. However, as this is a flake that I can't force to occur locally I'm at a loss at how to proceed to debug. Looking for ideas/help.

Jan 24 '25 21:01 cprivitere

Digging a bit into it:

KCP logs for that cluster:

❯ cat kcp.log| grep kcp-adoption --color | grep 'controller="kubeadmcontrolplane"' --color
I0113 13:03:56.425859       1 controller.go:185] "Cluster Controller has not yet set OwnerRef" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="3a38ac56-3e0e-42aa-a775-93765317ec57"
I0113 13:03:56.674880       1 controller.go:371] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="ba5fae1f-a1bf-43f1-a24b-6000f0dc2153" Cluster="kcp-adoption-hwn0kw/k
cp-adoption-0am4s3"
I0113 13:03:56.770301       1 controller.go:421] "Cluster does not yet have a ControlPlaneEndpoint defined" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="ba5fae1f-a1bf-43f1-a24b-6000f0dc2153" Clu
ster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
I0113 13:03:56.972648       1 controller.go:371] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="2b50f01b-20b8-444d-876e-76a51ab394ec" Cluster="kcp-adoption-hwn0kw/k
cp-adoption-0am4s3"
I0113 13:03:56.980947       1 controller.go:421] "Cluster does not yet have a ControlPlaneEndpoint defined" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="2b50f01b-20b8-444d-876e-76a51ab394ec" Cluster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
I0113 13:18:57.435618       1 controller.go:597] "Reconcile KubeadmControlPlane deletion" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="e3fdf5bc-0037-4da8-9c8b-479dea8c6f57" Cluster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"

Timeline:

...
13:03:56.980947 : KCP Reconciles and logs Cluster does not yet have a ControlPlaneEndpoint defined
- KCP in this case returns in a way so it needs an event or resync to reconcile again. The happy path is to get an event for the Cluster object because the ControlPlaneEndpoint was added.
13:18:57.435618 : Next KCP reconcile with the deletion

So taking a look when .spec.controlPlaneEndpoint was added to the Cluster object:

According to the Cluster objects managed fields, this was set at 2025-01-13T13:03:57Z, so after the above reconcile

This should emit an event to KCP leading to reconcile the KCP object and get back into the flow. But we don't get a Reconcile KubeadmControlPlane log line and also status does not get any update.

So I think KCP returns before even running this defer which would lead to status updates too: https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/controllers/controller.go#L212

The only relevant return above should be this one where we return because we wait for the adoption of Machines to happen (which is not the responsibility of the KCP controller): https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/controllers/controller.go#L209

To verify that: let's take a look at the Machine and especially the ownerReferences part:

  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: kcp-adoption-0am4s3
    uid: 46f2c0d3-626d-4dc5-a595-e6a81677825c

We only see a Cluster owner reference, but here we wait for adoptable machines which means (here) Machines which do not have a controller owner reference. This is the case here, an example of a controller owned KCP machine of a test-run which was green (xref):

  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: kcp-adoption-doaliz
    uid: df759ccf-2113-4eb5-9613-020aead2bf98
  - apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: KubeadmControlPlane
    name: kcp-adoption-doaliz-control-plane
    uid: 63b7645b-167e-46ca-a28c-a02562c422fb

So next step: why is that owner reference missing in the above example? It could be different reasons:

We somehow missed the update event on the Cluster which did set the .spec.controlPlaneEndpoint stuff
Or it reconciled but did return in some way that nothing happened (e.g. while trying to adopt?)

One idea to try to test it: Create a PR that:

tries to only run this test multiple times in a row until it fails (and no other tests)
once we have that:
- we could add some more logging or stuff on that PR to get a step further
- or dump resources more in different stages
- or increase log levels in CAPI or KCP controller to see what happened (or what did not happen)

Definetly not an easy one to figure out.

Jan 27 '25 14:01 chrischdi

The only relevant return above should be this one where we return because we wait for the adoption of Machines to happen (which is not the responsibility of the KCP controller

@chrischdi are you sure about that it's not kcp's responsibility?

I think it is + the name of the test is

Should adopt up-to-date control plane Machines without modification

Jan 28 '25 07:01 sbueringer

Generic un-dated search link for this error: https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn't%20stay%20stable&job=.-cluster-api-.e2e.&xjob=.-provider-.%7C.-operator-

This is still occurring frequently.

Jun 03 '25 15:06 cprivitere

Hi , I see this flake in release 1.9 Earlier it was happening on main, is it fixed on main?

https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.-cluster-api-.&xjob=.-provider-.%7C.*-operator-

Sep 11 '25 08:09 arshadd-b

@arshadd-b

I see some (fairly) recent changes in more recent release to this specific ("Should adopt up-to-date control plane Machines without modification") test that are not in 1.9 branch:

https://github.com/kubernetes-sigs/cluster-api/commit/cabafc38007bcb2991dd0038fd048c2366a0d2cd
https://github.com/kubernetes-sigs/cluster-api/commit/ed4e6b28e87e45cfd8072b857529290f295366cf
https://github.com/kubernetes-sigs/cluster-api/commit/671164eef28f901eb7cff7ceaca2d51aed377ea2
https://github.com/kubernetes-sigs/cluster-api/commit/e7ca5f4c17a7273320101e43c3b5fd100c533515

Sep 11 '25 14:09 s3rj1k

Assuming that this is a test-code issue, only thing that I can see is this

This change basically https://github.com/kubernetes-sigs/cluster-api/commit/cabafc38007bcb2991dd0038fd048c2366a0d2cd

Does not look like a test code issue :)

Sep 11 '25 14:09 s3rj1k

Yeah doesn't look like related to test what I am thinking is we will check the logs for this from artifacts, if we can find something there or as mentioned in the comment https://github.com/kubernetes-sigs/cluster-api/issues/11454#issuecomment-2615910311 raising dummy pr and trying to reproduce and putting more logs.

Sep 11 '25 15:09 arshadd-b

cluster-api cluster-api copied to clipboard

[e2e test flake] Timed waiting for all control plane replicas to be updated

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

Guidelines

cluster-api
cluster-api copied to clipboard