cluster-api Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable

Which jobs are flaking?

capi-e2e-main

Which tests are flaking?

When upgrading a workload cluster using ClusterClass with RuntimeSDK [ClusterClass] [It] Should create, upgrade and delete a workload cluster /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade_runtimesdk.go:155

Testgrid link

Edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake /area ci

  [FAILED] Failed after 63.517s.
  Resource versions didn't stay stable
  The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:53 with:
  Expected object to be comparable, diff:   map[string]string{
    	... // 11 identical entries
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-r5yi9k":                                              "38350",
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38404",
  - 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "39165",
  + 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "38721",
    	"DockerMachinePoolTemplate/k8s-upgrade-with-runtimesdk-05ptjc/quick-start-default-worker-machinepooltemplate": "29519",
    	"DockerMachineTemplate/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-0-pgr5r":      "30876",
    	... // 16 identical entries
    	"Machine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38573",
    	"MachineDeployment/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9": "38854",
  - 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "39168",
  + 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "38728",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-h754k":  "38853",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-tg2vx":  "38777",
    	... // 9 identical entries
    }
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:54 @ 06/27/24 04:19:26.795

Jul 08 '24 08:07 Sunnatillo

@Sunnatillo link is pointing to different failure.

Jul 09 '24 07:07 adilGhaffarDev

I updated it with correct link. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

Jul 09 '24 09:07 Sunnatillo

/help

Jul 17 '24 12:07 fabriziopandini

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Jul 17 '24 12:07 k8s-ci-robot

/assign

Jul 24 '24 20:07 willie-yao

I noticed that nodeVolumeDetachTimeout and minReadySeconds wasn't added to the machineDeployment spec for runtimesdk in #9393, so I'm gonna update that and see if the flake still happens.

Jul 24 '24 20:07 willie-yao

Fine to add, I don't think it will affect the results though

Jul 25 '24 07:07 sbueringer

Query to find the latest failures

Jul 31 '24 17:07 chrischdi

Improvement to make CAPD DockerMachinePools more deterministic: https://github.com/kubernetes-sigs/cluster-api/pull/10998

(I wouldn't expect it to solve the whole flake though)

Aug 02 '24 13:08 sbueringer

The CAPD flake seems to be gone now.

We only have a relatively rare flake with KCP left: https://storage.googleapis.com/k8s-triage/index.html?text=Detected%20objects%20with%20changed%20resourceVersion&job=.*cluster-api.e2e.main&xjob=.-provider-.

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1822127420073840640

Aug 13 '24 13:08 sbueringer

The CAPD flake seems to be gone now.

Will unassign myself for now but if this flake is persistent, I can take another look when I have time.

/unassign

Sep 11 '24 18:09 willie-yao

The MachinePool flake (https://github.com/kubernetes-sigs/cluster-api/issues/11162) is a lot more frequent/problematic

Sep 12 '24 05:09 sbueringer

I'll investigate it. /assign

Sep 12 '24 06:09 sivchari

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

Confirm that this issue is still relevant with /triage accepted (org members only)
Deprioritize it with /priority important-longterm or /priority backlog
Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

Dec 11 '24 09:12 k8s-triage-robot

/triage accepted

Dec 11 '24 10:12 sivchari

Just a note that the last occurrence of this seems to have been on 11/14/2024: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-latestk8s-main/1857118214752833536

Jan 14 '25 23:01 cprivitere

Currently, I can't have time to work this. /unassign

Jan 15 '25 01:01 sivchari

I think this still seems to happen (although the message changed):

https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn%27t%20stay%20stable&job=.-cluster-api-.&test=When%20upgrading%20a%20workload%20cluster%20using%20ClusterClass%20with%20RuntimeSDK&xjob=.-provider-.%7C.-cluster-api-operator-.

Jan 15 '25 08:01 chrischdi

Alright, checking in on this and 1/15 seems to be the last occurrence of the newer error message.

Feb 17 '25 14:02 cprivitere

Checking in again and it does seem to still be occurring. Last occurrence 5/27.

Jun 03 '25 15:06 cprivitere

This is a different test spec but seems to fail with the same error: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1932381270260584448

Jun 10 '25 16:06 mboersma

What I have noticed this issue is similar like https://github.com/kubernetes-sigs/cluster-api/issues/12334 just different test cases related to RuntimeSDK were failing , haven't observed this issue from last week after the fix https://github.com/kubernetes-sigs/cluster-api/pull/12902

I think we can close this as well cc : @chrischdi

Nov 03 '25 08:11 arshadd-b

Sounds good 🎉

xref:

https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20version&job=.*-cluster-api-.*main.*&test=.*RuntimeSDK.*&xjob=.*-provider-.*%7C.*-cluster-api-operator-.*

RuntimeSDK test still has some other flakes though. But maybe they are already tracked differently.

Nov 03 '25 08:11 chrischdi

yes That is different issue actually , new flakes as well closing it as it is fixed , Thanks @chrischdi for helping and fixing it /close

Nov 03 '25 09:11 arshadd-b

@arshadd-b: Closing this issue.

In response to this:

yes That is different issue actually , new flakes as well closing it as it is fixed , Thanks @chrischdi for helping and fixing it /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Nov 03 '25 09:11 k8s-ci-robot

cluster-api cluster-api copied to clipboard

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable

Which jobs are flaking?

Which tests are flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

Guidelines

cluster-api
cluster-api copied to clipboard