cluster-api-provider-azure capz-periodic-e2e-full-v1beta1-minus-1 CI job fails to delete AKS cluster

/kind bug

What steps did you take and what happened:

This new CI job has never passed. Digging into the logs, we see that the AKS test cluster fails to delete:

INFO: Waiting for the Cluster capz-e2e-293uvx/capz-e2e-293uvx-aks to be deleted
[1mSTEP[0m: Waiting for cluster capz-e2e-293uvx-aks to be deleted
[1mSTEP[0m: Redacting sensitive information from logs


[91m[1m• Failure in Spec Teardown (AfterEach) [3806.120 seconds][0m
Workload cluster creation
[90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:43[0m
  [91m[1mCreating an AKS cluster [EXPERIMENTAL] [AfterEach][0m
  [90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:501[0m
    with a single control plane node and 1 node
    [90m/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:502[0m

    [91mTimed out after 1800.003s.
    Expected
        <bool>: false
    to be true[0m

    /home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:175

    [91mFull Stack Trace[0m
    sigs.k8s.io/cluster-api/test/framework.WaitForClusterDeleted({0x3a73458?, 0xc0001b4008}, {{0x7f15e41e6a90?, 0xc00071ecb0?}, 0xc0019fd100?}, {0xc000e732e0, 0x2, 0x2})
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:175 +0x183
    sigs.k8s.io/cluster-api/test/framework.DeleteAllClustersAndWait({0x3a73458?, 0xc0001b4008}, {{0x3a800c0?, 0xc00071ecb0?}, {0xc000138a00?, 0xf?}}, {0xc000e732e0, 0x2, 0x2})
    	/home/prow/go/pkg/mod/sigs.k8s.io/cluster-api/[email protected]/framework/cluster_helpers.go:271 +0x412
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.dumpSpecResourcesAndCleanup({0x3a73458, 0xc0001b4008}, {{0x3678dca, 0x17}, {0x3a81a78, 0xc000b0ab00}, {0xc0006ae240, 0xf}, 0xc0004438c0, 0xc00179a340, ...})
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/common.go:169 +0x531
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.glob..func1.2()
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/azure_test.go:132 +0x285
    github.com/onsi/ginkgo/internal/leafnodes.(*runner).runSync(0x0?)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:113 +0xb1
    github.com/onsi/ginkgo/internal/leafnodes.(*runner).run(0x0?)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/runner.go:64 +0x125
    github.com/onsi/ginkgo/internal/leafnodes.(*SetupNode).Run(0x15db0b9?)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/leafnodes/setup_nodes.go:15 +0x7b
    github.com/onsi/ginkgo/internal/spec.(*Spec).runSample.func1()
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/spec/spec.go:180 +0x209
    github.com/onsi/ginkgo/internal/spec.(*Spec).runSample(0xc0001a2780, 0xc003aad9b0?, {0x3a50900, 0xc0001f48c0})
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/spec/spec.go:218 +0x366
    github.com/onsi/ginkgo/internal/spec.(*Spec).Run(0xc0001a2780, {0x3a50900, 0xc0001f48c0})
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/spec/spec.go:138 +0xe7
    github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpec(0xc0003f6160, 0xc0001a2780)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/specrunner/spec_runner.go:200 +0xe8
    github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).runSpecs(0xc0003f6160)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/specrunner/spec_runner.go:170 +0x1a5
    github.com/onsi/ginkgo/internal/specrunner.(*SpecRunner).Run(0xc0003f6160)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/specrunner/spec_runner.go:66 +0xc5
    github.com/onsi/ginkgo/internal/suite.(*Suite).Run(0xc00021acb0, {0x7f15e4354f68, 0xc0000cf1e0}, {0x3655d31, 0x8}, {0xc00077e820, 0x2, 0x2}, {0x3a76148, 0xc0001f48c0}, ...)
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/internal/suite/suite.go:79 +0x4d2
    github.com/onsi/ginkgo.runSpecsWithCustomReporters({0x3a542a0?, 0xc0000cf1e0}, {0x3655d31, 0x8}, {0xc00077e800, 0x2, 0x367c5a9?})
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/ginkgo_dsl.go:245 +0x189
    github.com/onsi/ginkgo.RunSpecsWithDefaultAndCustomReporters({0x3a542a0, 0xc0000cf1e0}, {0x3655d31, 0x8}, {0xc00009ef20, 0x1, 0x1})
    	/home/prow/go/pkg/mod/github.com/onsi/[email protected]/ginkgo_dsl.go:228 +0x1b6
    sigs.k8s.io/cluster-api-provider-azure/test/e2e.TestE2E(0x48?)
    	/home/prow/go/src/sigs.k8s.io/cluster-api-provider-azure/test/e2e/e2e_suite_test.go:56 +0x19a
    testing.tRunner(0xc0000cf1e0, 0x379c388)
    	/usr/local/go/src/testing/testing.go:1439 +0x102
    created by testing.(*T).Run
    	/usr/local/go/src/testing/testing.go:1486 +0x35f
[90m------------------------------[0m

I gather this is a known issue we were not able to pinpoint, but which is fixed in the main branch.

What did you expect to happen:

Anything else you would like to add:

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Oct 03 '22 19:10 mboersma

@jackfrancis this is the same failure we've observed before. It seems to only affect release-1.4 currently, release-1.5 and main branch are not having the same issue.

Oct 03 '22 21:10 CecileRobertMichon

This now affects all branches.

It appears that there is some correlation between large ginkgo test runs and the invocation of sigs.k8s.io/cluster-api/test/frameworkDeleteAllClustersAndWait in capz's dumpSpecResourcesAndCleanup convenience func. That convenience func is called a bunch of places, including in the AfterEach ginkgo interface. Perhaps there is so goroutine pathology somewhere.

Nov 03 '22 18:11 jackfrancis

I ran both all the tests and just the AKS test locally and grabbed full capz-controller-manager logs from each run (using main, 0b057a692a2fb421be7cedb413ba724559d753fb). Deleting the AKS cluster in a test run by itself took ~15 minutes. Deleting the AKS cluster alongside all the other tests took ~29 minutes, so right up against the 30m timeout.

Breaking that down:

service	AKS alone	all tests
managedclusters	14m	14m
subnets	3s	3s
virtualnetworks	3s	9.5m
group	90s	5m

When running all the tests, a "context deadline exceeded" error starts appearing once the vnet starts being deleted and sticks around for the next ~3 minutes. After the last of these, it takes 6 minutes for another reconcile to be invoked.

I'm not exactly sure how to interpret all this, but those were my main takeaways after a first pass.

Here's the full log from each run: https://gist.github.com/nojnhuh/c4769422d88501eeb052ef66fd0c1000

Nov 03 '22 21:11 nojnhuh

I suspect that these issues, observed in many tests over the past several months, is due to: https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/3075

Jan 23 '23 20:01 jackfrancis

cluster-api-provider-azure cluster-api-provider-azure copied to clipboard

capz-periodic-e2e-full-v1beta1-minus-1 CI job fails to delete AKS cluster

cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard