cluster-api icon indicating copy to clipboard operation
cluster-api copied to clipboard

Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable

Open Sunnatillo opened this issue 1 year ago • 21 comments

Which jobs are flaking?

capi-e2e-main

Which tests are flaking?

When upgrading a workload cluster using ClusterClass with RuntimeSDK [ClusterClass] [It] Should create, upgrade and delete a workload cluster /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade_runtimesdk.go:155

Testgrid link

Edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

No response

Reason for failure (if possible)

No response

Anything else we need to know?

No response

Label(s) to be applied

/kind flake /area ci

  [FAILED] Failed after 63.517s.
  Resource versions didn't stay stable
  The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:53 with:
  Expected object to be comparable, diff:   map[string]string{
    	... // 11 identical entries
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-r5yi9k":                                              "38350",
    	"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38404",
  - 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "39165",
  + 	"DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd":          "38721",
    	"DockerMachinePoolTemplate/k8s-upgrade-with-runtimesdk-05ptjc/quick-start-default-worker-machinepooltemplate": "29519",
    	"DockerMachineTemplate/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-0-pgr5r":      "30876",
    	... // 16 identical entries
    	"Machine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9":                                              "38573",
    	"MachineDeployment/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9": "38854",
  - 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "39168",
  + 	"MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79":       "38728",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-h754k":  "38853",
    	"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-tg2vx":  "38777",
    	... // 9 identical entries
    }
  In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:54 @ 06/27/24 04:19:26.795

Sunnatillo avatar Jul 08 '24 08:07 Sunnatillo

@Sunnatillo link is pointing to different failure.

adilGhaffarDev avatar Jul 09 '24 07:07 adilGhaffarDev

I updated it with correct link. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568

Sunnatillo avatar Jul 09 '24 09:07 Sunnatillo

/help

fabriziopandini avatar Jul 17 '24 12:07 fabriziopandini

@fabriziopandini: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Jul 17 '24 12:07 k8s-ci-robot

/assign

willie-yao avatar Jul 24 '24 20:07 willie-yao

I noticed that nodeVolumeDetachTimeout and minReadySeconds wasn't added to the machineDeployment spec for runtimesdk in #9393, so I'm gonna update that and see if the flake still happens.

willie-yao avatar Jul 24 '24 20:07 willie-yao

Fine to add, I don't think it will affect the results though

sbueringer avatar Jul 25 '24 07:07 sbueringer

Query to find the latest failures

chrischdi avatar Jul 31 '24 17:07 chrischdi

Improvement to make CAPD DockerMachinePools more deterministic: https://github.com/kubernetes-sigs/cluster-api/pull/10998

(I wouldn't expect it to solve the whole flake though)

sbueringer avatar Aug 02 '24 13:08 sbueringer

The CAPD flake seems to be gone now.

We only have a relatively rare flake with KCP left: https://storage.googleapis.com/k8s-triage/index.html?text=Detected%20objects%20with%20changed%20resourceVersion&job=.*cluster-api.e2e.main&xjob=.-provider-.

Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1822127420073840640

sbueringer avatar Aug 13 '24 13:08 sbueringer

The CAPD flake seems to be gone now.

Will unassign myself for now but if this flake is persistent, I can take another look when I have time.

/unassign

willie-yao avatar Sep 11 '24 18:09 willie-yao

The MachinePool flake (https://github.com/kubernetes-sigs/cluster-api/issues/11162) is a lot more frequent/problematic

sbueringer avatar Sep 12 '24 05:09 sbueringer

I'll investigate it. /assign

sivchari avatar Sep 12 '24 06:09 sivchari

This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged. Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.

You can:

  • Confirm that this issue is still relevant with /triage accepted (org members only)
  • Deprioritize it with /priority important-longterm or /priority backlog
  • Close this issue with /close

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted

k8s-triage-robot avatar Dec 11 '24 09:12 k8s-triage-robot

/triage accepted

sivchari avatar Dec 11 '24 10:12 sivchari

Just a note that the last occurrence of this seems to have been on 11/14/2024: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-latestk8s-main/1857118214752833536

cprivitere avatar Jan 14 '25 23:01 cprivitere

Currently, I can't have time to work this. /unassign

sivchari avatar Jan 15 '25 01:01 sivchari

I think this still seems to happen (although the message changed):

https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn%27t%20stay%20stable&job=.-cluster-api-.&test=When%20upgrading%20a%20workload%20cluster%20using%20ClusterClass%20with%20RuntimeSDK&xjob=.-provider-.%7C.-cluster-api-operator-.

chrischdi avatar Jan 15 '25 08:01 chrischdi

Alright, checking in on this and 1/15 seems to be the last occurrence of the newer error message.

cprivitere avatar Feb 17 '25 14:02 cprivitere

Checking in again and it does seem to still be occurring. Last occurrence 5/27.

cprivitere avatar Jun 03 '25 15:06 cprivitere

This is a different test spec but seems to fail with the same error: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1932381270260584448

mboersma avatar Jun 10 '25 16:06 mboersma

What I have noticed this issue is similar like https://github.com/kubernetes-sigs/cluster-api/issues/12334 just different test cases related to RuntimeSDK were failing , haven't observed this issue from last week after the fix https://github.com/kubernetes-sigs/cluster-api/pull/12902

I think we can close this as well cc : @chrischdi

arshadd-b avatar Nov 03 '25 08:11 arshadd-b

Sounds good 🎉

xref:

https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20version&job=.*-cluster-api-.*main.*&test=.*RuntimeSDK.*&xjob=.*-provider-.*%7C.*-cluster-api-operator-.*

RuntimeSDK test still has some other flakes though. But maybe they are already tracked differently.

chrischdi avatar Nov 03 '25 08:11 chrischdi

yes That is different issue actually , new flakes as well closing it as it is fixed , Thanks @chrischdi for helping and fixing it /close

arshadd-b avatar Nov 03 '25 09:11 arshadd-b

@arshadd-b: Closing this issue.

In response to this:

yes That is different issue actually , new flakes as well closing it as it is fixed , Thanks @chrischdi for helping and fixing it /close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot avatar Nov 03 '25 09:11 k8s-ci-robot