cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

[e2e test flake] Timed waiting for all control plane replicas to be updated

Open cahillsf opened this issue 1 year ago • 14 comments

Which jobs are flaking?

periodic-cluster-api-e2e-mink8s-main periodic-cluster-api-e2e-main

Which tests are flaking?

When testing KCP adoption Should adopt up-to-date control plane Machines without modification

Since when has it been flaking?

2024-11-05 is the first instance of this flake i can find: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-06&text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.periodic-cluster-api-e2e.

now it appears quite consistently: https://storage.googleapis.com/k8s-triage/index.html?date=2024-11-20&text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.periodic-cluster-api-e2e.

Testgrid link

https://testgrid.k8s.io/sig-cluster-lifecycle-cluster-api#capi-e2e-main

Reason for failure (if possible)

TBD (have not had time to investigate thoroghly)

Anything else we need to know?

example failed job: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-main/1858697443558821888

have filtered commits on main to around the time this flake popped up: https://github.com/kubernetes-sigs/cluster-api/commits/main/?since=2024-11-03&until=2024-11-05

have not have time to further investigate yet

Label(s) to be applied

/kind flake

cahillsf avatar Nov 20 '24 18:11 cahillsf

/area e2e-testing

cahillsf avatar Nov 20 '24 18:11 cahillsf

I confirm this flake. /assign

sivchari avatar Nov 21 '24 04:11 sivchari

did we fix something related to this? I don't see any failure after 23-11 https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.-cluster-api-.&xjob=.-provider-.%7C.-operator-.

adilGhaffarDev avatar Nov 26 '24 08:11 adilGhaffarDev

New occurencies now.

Seems not to be gone.

chrischdi avatar Nov 27 '24 13:11 chrischdi

@sivchari 👋 are you still working on this?

cahillsf avatar Jan 09 '25 20:01 cahillsf

Hi @cahillsf , I didn't have time to work on this. I'll try it since next week. Or should I unassign myself ? Thanks.

sivchari avatar Jan 10 '25 04:01 sivchari

Hi @cahillsf , I didn't have time to work on this. I'll try it since next week. Or should I unassign myself ? Thanks.

if you have time please go ahead, even if you dont have a fix would be helpful to add findings. if youre not going to have time then feel free to unassign and we can find some assignee from the CI release team for this cycle

cahillsf avatar Jan 13 '25 18:01 cahillsf

Hi, @cahillsf Once, I unassign me. /unassign

sivchari avatar Jan 15 '25 01:01 sivchari

/help

chrischdi avatar Jan 22 '25 14:01 chrischdi

@chrischdi: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jan 22 '25 14:01 k8s-ci-robot

Don't know how helpful this will be, but I've spent some time looking into the most recent failure, logs here: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-main/1878784832272601088

First my understanding of the basic structure of the test:

  • create a cluster
  • create a machine in that cluster with the proper naming to become a control plane
  • then add a kubeadmcontrolplane object to the cluster
  • Wait for the kubeadmcontrolplane object to be ready
  • Wait for capi to realize that earlier machine would work for the cluster's kubeadmcontrolplane and take ownership of it
  • delete the cluster.

Please correct me if I'm wrong or missing important minutia here.

So this test seems to be failing on the "wait for the kubeadmcontrolplane object to be ready" step. Here are the problems with the kcp object according to its status fields:

  • Available is False due to "Waiting for etcd to report the list of members"
  • EtcdClusterHealthy is Unknown due to "Waiting for Cluster spec.controlPlaneEndpoint to be set"
  • ControlPlaneComponentsHealthy is Unknown due to "Waiting for Cluster spec.controlPlaneEndpoint to be set"
  • Scaling up is True due to "Scaling up from 0 to 1 replicas"

The controlPlaneEndpoint issues are confusing to me as the controlPlaneEndpoint is set in all the objects that I know it should be set in, but I'm probably missing one or this is something local to the kcp object in a way I don't understand.

Were this a repeatable error, I'd try to get the e2e test to leave the machine around and then get onto it and investigate the cloud-init output to see if there are any errors there. However, as this is a flake that I can't force to occur locally I'm at a loss at how to proceed to debug. Looking for ideas/help.

cprivitere avatar Jan 24 '25 21:01 cprivitere

Digging a bit into it:

KCP logs for that cluster:

❯ cat kcp.log| grep kcp-adoption --color | grep 'controller="kubeadmcontrolplane"' --color
I0113 13:03:56.425859       1 controller.go:185] "Cluster Controller has not yet set OwnerRef" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="3a38ac56-3e0e-42aa-a775-93765317ec57"
I0113 13:03:56.674880       1 controller.go:371] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="ba5fae1f-a1bf-43f1-a24b-6000f0dc2153" Cluster="kcp-adoption-hwn0kw/k
cp-adoption-0am4s3"
I0113 13:03:56.770301       1 controller.go:421] "Cluster does not yet have a ControlPlaneEndpoint defined" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="ba5fae1f-a1bf-43f1-a24b-6000f0dc2153" Clu
ster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
I0113 13:03:56.972648       1 controller.go:371] "Reconcile KubeadmControlPlane" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="2b50f01b-20b8-444d-876e-76a51ab394ec" Cluster="kcp-adoption-hwn0kw/k
cp-adoption-0am4s3"
I0113 13:03:56.980947       1 controller.go:421] "Cluster does not yet have a ControlPlaneEndpoint defined" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="2b50f01b-20b8-444d-876e-76a51ab394ec" Cluster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"
I0113 13:18:57.435618       1 controller.go:597] "Reconcile KubeadmControlPlane deletion" controller="kubeadmcontrolplane" controllerGroup="controlplane.cluster.x-k8s.io" controllerKind="KubeadmControlPlane" KubeadmControlPlane="kcp-adoption-hwn0kw/kcp-adoption-0am4s3-control-plane" namespace="kcp-adoption-hwn0kw" name="kcp-adoption-0am4s3-control-plane" reconcileID="e3fdf5bc-0037-4da8-9c8b-479dea8c6f57" Cluster="kcp-adoption-hwn0kw/kcp-adoption-0am4s3"

Timeline:

  • ...
  • 13:03:56.980947 : KCP Reconciles and logs Cluster does not yet have a ControlPlaneEndpoint defined
    • KCP in this case returns in a way so it needs an event or resync to reconcile again. The happy path is to get an event for the Cluster object because the ControlPlaneEndpoint was added.
  • 13:18:57.435618 : Next KCP reconcile with the deletion

So taking a look when .spec.controlPlaneEndpoint was added to the Cluster object:

  • According to the Cluster objects managed fields, this was set at 2025-01-13T13:03:57Z, so after the above reconcile

This should emit an event to KCP leading to reconcile the KCP object and get back into the flow. But we don't get a Reconcile KubeadmControlPlane log line and also status does not get any update.

So I think KCP returns before even running this defer which would lead to status updates too: https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/controllers/controller.go#L212

The only relevant return above should be this one where we return because we wait for the adoption of Machines to happen (which is not the responsibility of the KCP controller): https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/internal/controllers/controller.go#L209

To verify that: let's take a look at the Machine and especially the ownerReferences part:

  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: kcp-adoption-0am4s3
    uid: 46f2c0d3-626d-4dc5-a595-e6a81677825c

We only see a Cluster owner reference, but here we wait for adoptable machines which means (here) Machines which do not have a controller owner reference. This is the case here, an example of a controller owned KCP machine of a test-run which was green (xref):

  ownerReferences:
  - apiVersion: cluster.x-k8s.io/v1beta1
    kind: Cluster
    name: kcp-adoption-doaliz
    uid: df759ccf-2113-4eb5-9613-020aead2bf98
  - apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: KubeadmControlPlane
    name: kcp-adoption-doaliz-control-plane
    uid: 63b7645b-167e-46ca-a28c-a02562c422fb

So next step: why is that owner reference missing in the above example? It could be different reasons:

  • We somehow missed the update event on the Cluster which did set the .spec.controlPlaneEndpoint stuff
  • Or it reconciled but did return in some way that nothing happened (e.g. while trying to adopt?)

One idea to try to test it: Create a PR that:

  • tries to only run this test multiple times in a row until it fails (and no other tests)
  • once we have that:
    • we could add some more logging or stuff on that PR to get a step further
    • or dump resources more in different stages
    • or increase log levels in CAPI or KCP controller to see what happened (or what did not happen)

Definetly not an easy one to figure out.

chrischdi avatar Jan 27 '25 14:01 chrischdi

The only relevant return above should be this one where we return because we wait for the adoption of Machines to happen (which is not the responsibility of the KCP controller

@chrischdi are you sure about that it's not kcp's responsibility?

I think it is + the name of the test is

Should adopt up-to-date control plane Machines without modification

sbueringer avatar Jan 28 '25 07:01 sbueringer

Generic un-dated search link for this error: https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn't%20stay%20stable&job=.-cluster-api-.e2e.&xjob=.-provider-.%7C.-operator-

This is still occurring frequently.

cprivitere avatar Jun 03 '25 15:06 cprivitere

Hi , I see this flake in release 1.9 Earlier it was happening on main, is it fixed on main?

https://storage.googleapis.com/k8s-triage/index.html?text=Timed%20waiting%20for%20all%20control%20plane%20replicas%20to%20be%20updated&job=.-cluster-api-.&xjob=.-provider-.%7C.*-operator-

arshadd-b avatar Sep 11 '25 08:09 arshadd-b

@arshadd-b

I see some (fairly) recent changes in more recent release to this specific ("Should adopt up-to-date control plane Machines without modification") test that are not in 1.9 branch:

  • https://github.com/kubernetes-sigs/cluster-api/commit/cabafc38007bcb2991dd0038fd048c2366a0d2cd
  • https://github.com/kubernetes-sigs/cluster-api/commit/ed4e6b28e87e45cfd8072b857529290f295366cf
  • https://github.com/kubernetes-sigs/cluster-api/commit/671164eef28f901eb7cff7ceaca2d51aed377ea2
  • https://github.com/kubernetes-sigs/cluster-api/commit/e7ca5f4c17a7273320101e43c3b5fd100c533515

s3rj1k avatar Sep 11 '25 14:09 s3rj1k

Assuming that this is a test-code issue, only thing that I can see is this

Image

This change basically https://github.com/kubernetes-sigs/cluster-api/commit/cabafc38007bcb2991dd0038fd048c2366a0d2cd

Does not look like a test code issue :)

s3rj1k avatar Sep 11 '25 14:09 s3rj1k

Yeah doesn't look like related to test what I am thinking is we will check the logs for this from artifacts, if we can find something there or as mentioned in the comment https://github.com/kubernetes-sigs/cluster-api/issues/11454#issuecomment-2615910311 raising dummy pr and trying to reproduce and putting more logs.

arshadd-b avatar Sep 11 '25 15:09 arshadd-b