cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard
capz-periodic-e2e-full-v1beta1-minus-1 CI job fails to delete AKS cluster
/kind bug
What steps did you take and what happened:
This new CI job has never passed. Digging into the logs, we see that the AKS test cluster fails to delete:
INFO: Waiting for the Cluster capz-e2e-293uvx/capz-e2e-293uvx-aks to be deleted
[1mSTEP[0m: Waiting for cluster capz-e2e-293uvx-aks to be deleted
[1mSTEP[0m: Redacting sensitive information from logs
[91m[1m• Failure in Spec Teardown (AfterEach) [3806.120 seconds][0m
Workload cluster creation
[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:43[0m
[91m[1mCreating an AKS cluster [EXPERIMENTAL] [AfterEach][0m
[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:501[0m
with a single control plane node and 1 node
[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:502[0m
[91mTimed out after 1800.003s.
Expected
<bool>: false
to be true[0m
/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:175
[91mFull Stack Trace[0m
sigs.k8s.io/cluster-api/test/framework.WaitForClusterDeleted({0x3a73458?, 0xc0001b4008}, {{0x7f15e41e6a90?, 0xc00071ecb0?}, 0xc0019fd100?}, {0xc000e732e0, 0x2, 0x2})
/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:175 +0x183
sigs.k8s.io/cluster-api/test/framework.DeleteAllClustersAndWait({0x3a73458?, 0xc0001b4008}, {{0x3a800c0?, 0xc00071ecb0?}, {0xc000138a00?, 0xf?}}, {0xc000e732e0, 0x2, 0x2})
/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:271 +0x412
sigs.k8s.io/cluster-api-provider-azure/test/e2e.dumpSpecResourcesAndCleanup({0x3a73458, 0xc0001b4008}, {{0x3678dca, 0x17}, {0x3a81a78, 0xc000b0ab00}, {0xc0006ae240, 0xf}, 0xc0004438c0, 0xc00179a340, ...})
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:169 +0x531
sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func1.2()
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:132 +0x285
github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0x0?)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:113 +0xb1
github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0x0?)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:64 +0x125
github.com/onsi/ginkgo/internal/leafnodes.(*SetupNode).Run(0x15db0b9?)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/setup_nodes.go:15 +0x7b
github.com/onsi/ginkgo/internal/spec.(*Spec).runSample.func1()
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/spec/spec.go:180 +0x209
github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc0001a2780, 0xc003aad9b0?, {0x3a50900, 0xc0001f48c0})
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/spec/spec.go:218 +0x366
github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc0001a2780, {0x3a50900, 0xc0001f48c0})
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/spec/spec.go:138 +0xe7
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc0003f6160, 0xc0001a2780)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/specrunner/spec_runner.go:200 +0xe8
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc0003f6160)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/specrunner/spec_runner.go:170 +0x1a5
github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc0003f6160)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/specrunner/spec_runner.go:66 +0xc5
github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc00021acb0, {0x7f15e4354f68, 0xc0000cf1e0}, {0x3655d31, 0x8}, {0xc00077e820, 0x2, 0x2}, {0x3a76148, 0xc0001f48c0}, ...)
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/suite/suite.go:79 +0x4d2
github.com/onsi/ginkgo.runSpecsWithCustomReporters({0x3a542a0?, 0xc0000cf1e0}, {0x3655d31, 0x8}, {0xc00077e800, 0x2, 0x367c5a9?})
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/ginkgo_dsl.go:245 +0x189
github.com/onsi/ginkgo.RunSpecsWithDefaultAndCustomReporters({0x3a542a0, 0xc0000cf1e0}, {0x3655d31, 0x8}, {0xc00009ef20, 0x1, 0x1})
/home/prow/go/pkg/mod/github.com/onsi/[email protected]/ginkgo_dsl.go:228 +0x1b6
sigs.k8s.io/cluster-api-provider-azure/test/e2e.TestE2E(0x48?)
/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/e2e_suite_test.go:56 +0x19a
testing.tRunner(0xc0000cf1e0, 0x379c388)
/usr/local/go/src/testing/testing.go:1439 +0x102
created by testing.(*T).Run
/usr/local/go/src/testing/testing.go:1486 +0x35f
[90m------------------------------[0m
I gather this is a known issue we were not able to pinpoint, but which is fixed in the main branch.
What did you expect to happen:
Anything else you would like to add:
Environment:
- cluster-api-provider-azure version:
- Kubernetes version: (use
kubectl version): - OS (e.g. from
/etc/os-release):
@jackfrancis this is the same failure we've observed before. It seems to only affect release-1.4 currently, release-1.5 and main branch are not having the same issue.
This now affects all branches.
It appears that there is some correlation between large ginkgo test runs and the invocation of sigs.k8s.io/cluster-api/test/frameworkDeleteAllClustersAndWait in capz's dumpSpecResourcesAndCleanup convenience func. That convenience func is called a bunch of places, including in the AfterEach ginkgo interface. Perhaps there is so goroutine pathology somewhere.
I ran both all the tests and just the AKS test locally and grabbed full capz-controller-manager logs from each run (using main, 0b057a692a2fb421be7cedb413ba724559d753fb). Deleting the AKS cluster in a test run by itself took ~15 minutes. Deleting the AKS cluster alongside all the other tests took ~29 minutes, so right up against the 30m timeout.
Breaking that down:
| service | AKS alone | all tests |
|---|---|---|
| managedclusters | 14m | 14m |
| subnets | 3s | 3s |
| virtualnetworks | 3s | 9.5m |
| group | 90s | 5m |
When running all the tests, a "context deadline exceeded" error starts appearing once the vnet starts being deleted and sticks around for the next ~3 minutes. After the last of these, it takes 6 minutes for another reconcile to be invoked.
I'm not exactly sure how to interpret all this, but those were my main takeaways after a first pass.
Here's the full log from each run: https://gist.github.com/nojnhuh/c4769422d88501eeb052ef66fd0c1000
I suspect that these issues, observed in many tests over the past several months, is due to: https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3075