cluster-api
cluster-api copied to clipboard
Upgrading a workload cluster using ClusterClass with RuntimeSDK test is flaky with error: Resource versions didn't stay stable
Which jobs are flaking?
capi-e2e-main
Which tests are flaking?
When upgrading a workload cluster using ClusterClass with RuntimeSDK [ClusterClass] [It] Should create, upgrade and delete a workload cluster /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/cluster_upgrade_runtimesdk.go:155
Testgrid link
Edited: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568
No response
Reason for failure (if possible)
No response
Anything else we need to know?
No response
Label(s) to be applied
/kind flake /area ci
[FAILED] Failed after 63.517s.
Resource versions didn't stay stable
The function passed to Consistently failed at /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:53 with:
Expected object to be comparable, diff: map[string]string{
... // 11 identical entries
"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-r5yi9k": "38350",
"DockerMachine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9": "38404",
- "DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd": "39165",
+ "DockerMachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-0-vtdfd": "38721",
"DockerMachinePoolTemplate/k8s-upgrade-with-runtimesdk-05ptjc/quick-start-default-worker-machinepooltemplate": "29519",
"DockerMachineTemplate/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-0-pgr5r": "30876",
... // 16 identical entries
"Machine/k8s-upgrade-with-runtimesdk-05ptjc/worker-vlj8b9": "38573",
"MachineDeployment/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9": "38854",
- "MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79": "39168",
+ "MachinePool/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-mp-mp-0-b8r79": "38728",
"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-h754k": "38853",
"MachineSet/k8s-upgrade-with-runtimesdk-05ptjc/k8s-upgrade-with-runtimesdk-z1t5eg-md-md-0-qjkf9-tg2vx": "38777",
... // 9 identical entries
}
In [It] at: /home/prow/go/src/sigs.k8s.io/cluster-api/test/framework/resourceversion_helpers.go:54 @ 06/27/24 04:19:26.795
@Sunnatillo link is pointing to different failure.
I updated it with correct link. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-7/1809819550426861568
/help
@fabriziopandini: This request has been marked as needing help from a contributor.
Guidelines
Please ensure that the issue body includes answers to the following questions:
- Why are we solving this issue?
- To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
- Does this issue have zero to low barrier of entry?
- How can the assignee reach out to you for help?
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.
In response to this:
/help
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
/assign
I noticed that nodeVolumeDetachTimeout and minReadySeconds wasn't added to the machineDeployment spec for runtimesdk in #9393, so I'm gonna update that and see if the flake still happens.
Fine to add, I don't think it will affect the results though
Query to find the latest failures
Improvement to make CAPD DockerMachinePools more deterministic: https://github.com/kubernetes-sigs/cluster-api/pull/10998
(I wouldn't expect it to solve the whole flake though)
The CAPD flake seems to be gone now.
We only have a relatively rare flake with KCP left: https://storage.googleapis.com/k8s-triage/index.html?text=Detected%20objects%20with%20changed%20resourceVersion&job=.*cluster-api.e2e.main&xjob=.-provider-.
Example: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-main/1822127420073840640
The CAPD flake seems to be gone now.
Will unassign myself for now but if this flake is persistent, I can take another look when I have time.
/unassign
The MachinePool flake (https://github.com/kubernetes-sigs/cluster-api/issues/11162) is a lot more frequent/problematic
I'll investigate it. /assign
This issue is labeled with priority/important-soon but has not been updated in over 90 days, and should be re-triaged.
Important-soon issues must be staffed and worked on either currently, or very soon, ideally in time for the next release.
You can:
- Confirm that this issue is still relevant with
/triage accepted(org members only) - Deprioritize it with
/priority important-longtermor/priority backlog - Close this issue with
/close
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/
/remove-triage accepted
/triage accepted
Just a note that the last occurrence of this seems to have been on 11/14/2024: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-latestk8s-main/1857118214752833536
Currently, I can't have time to work this. /unassign
I think this still seems to happen (although the message changed):
https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20versions%20didn%27t%20stay%20stable&job=.-cluster-api-.&test=When%20upgrading%20a%20workload%20cluster%20using%20ClusterClass%20with%20RuntimeSDK&xjob=.-provider-.%7C.-cluster-api-operator-.
Alright, checking in on this and 1/15 seems to be the last occurrence of the newer error message.
Checking in again and it does seem to still be occurring. Last occurrence 5/27.
This is a different test spec but seems to fail with the same error: https://prow.k8s.io/view/gs/kubernetes-ci-logs/logs/periodic-cluster-api-e2e-mink8s-release-1-9/1932381270260584448
What I have noticed this issue is similar like https://github.com/kubernetes-sigs/cluster-api/issues/12334 just different test cases related to RuntimeSDK were failing , haven't observed this issue from last week after the fix https://github.com/kubernetes-sigs/cluster-api/pull/12902
I think we can close this as well cc : @chrischdi
Sounds good 🎉
xref:
https://storage.googleapis.com/k8s-triage/index.html?text=Resource%20version&job=.*-cluster-api-.*main.*&test=.*RuntimeSDK.*&xjob=.*-provider-.*%7C.*-cluster-api-operator-.*
RuntimeSDK test still has some other flakes though. But maybe they are already tracked differently.
yes That is different issue actually , new flakes as well closing it as it is fixed , Thanks @chrischdi for helping and fixing it /close
@arshadd-b: Closing this issue.
In response to this:
yes That is different issue actually , new flakes as well closing it as it is fixed , Thanks @chrischdi for helping and fixing it /close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.