cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: acceptance/gossip/locality-address failed

Open cockroach-teamcity opened this issue 1 year ago • 4 comments

roachtest.acceptance/gossip/locality-address failed with artifacts on master @ 59b261a579cbe2c032a5dd3e182ff67aeee900b9:

(test_runner.go:1237).runTest: test timed out (10m0s)
test artifacts and logs in: /artifacts/acceptance/gossip/locality-address/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-38636

cockroach-teamcity avatar May 11 '24 06:05 cockroach-teamcity

goroutine 11113 [sync.Mutex.Lock, 1 minutes]:
sync.runtime_SemacquireMutex(0x747b95e?, 0x6?, 0x7f0b029297f8?)
	GOROOT/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc001c92150)
	GOROOT/src/sync/mutex.go:171 +0x15d
sync.(*Mutex).Lock(0x10?)
	GOROOT/src/sync/mutex.go:90 +0x32
github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce.Init.NewDNSProvider.func1(0xc003466dc0)
	github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce/dns.go:59 +0x49
github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce.(*dnsProvider).CreateRecords(0xc0035f6048, {0x9078530, 0xc001e90b40}, {0xc0014e61c0, 0x2, 0xc0014e61c0?})
	github.com/cockroachdb/cockroach/pkg/roachprod/vm/gce/dns.go:116 +0x73e
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).RegisterServices.func1({0x7f0b00df4c50, 0xda89220})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/services.go:324 +0x354
github.com/cockroachdb/cockroach/pkg/roachprod/vm.ForDNSProvider({0xc0049f5fc0, 0x3}, 0xc002dd8da8)
	github.com/cockroachdb/cockroach/pkg/roachprod/vm/dns.go:116 +0x122
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).RegisterServices(0xc0031bc360, {0x9078530, 0xc001e90b40}, {0xc0027146c0, 0x2, 0x7714e01?})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/services.go:308 +0x348
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).maybeRegisterServices(0xc0031bc360, {0x9078530, 0xc001e90b40}, 0xc004f19bc0, {0x0, {0xc0020847c0, 0x1, 0x1}, 0x1, {0x7714e01, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:292 +0x2a5
github.com/cockroachdb/cockroach/pkg/roachprod/install.(*SyncedCluster).Start(0xc0031bc360, {0x9078530, 0xc001e90b40}, 0xc004f19bc0, {0x0, {0xc0020847c0, 0x1, 0x1}, 0x1, {0x7714e01, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/roachprod/install/cockroach.go:408 +0x376
github.com/cockroachdb/cockroach/pkg/roachprod.Start({0x9078530, 0xc001e90b40}, 0xc004f19bc0, {0xc0054da360?, 0xc002180008?}, {0x0, {0xc0020847c0, 0x1, 0x1}, 0x1, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:751 +0xba
main.(*clusterImpl).StartE(_, {_, _}, _, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, ...}, ...)
	main/pkg/cmd/roachtest/cluster.go:2076 +0x46e
main.(*clusterImpl).Start(_, {_, _}, _, {{0x0, 0x0, 0x0}, {0x0, 0x0, 0x0}, ...}, ...)
	main/pkg/cmd/roachtest/cluster.go:2236 +0xbe
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.runCheckLocalityIPAddress({0x9078530, 0xc001e90b40}, {0x911f5a0, 0xc0018b9760}, {0x916bf30, 0xc002844488})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/gossip.go:516 +0x283
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerAcceptance.func1({0x9078530?, 0xc001e90b40?}, {0x911f5a0?, 0xc0018b9760?}, {0x916bf30?, 0xc002844488?})
	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/acceptance.go:152 +0x3a
main.(*testRunner).runTest.func2()
	main/pkg/cmd/roachtest/test_runner.go:1208 +0xf2
created by main.(*testRunner).runTest in goroutine 74
	main/pkg/cmd/roachtest/test_runner.go:1192 +0x927

Node startup failed. Based on the stacks in __stacks.log, this looks like some kind of infra failure. I'll move this to test-eng, who may know more.

nvb avatar May 13 '24 12:05 nvb

cc @cockroachdb/test-eng

blathers-crl[bot] avatar May 13 '24 12:05 blathers-crl[bot]

@herkolategan This is the mutex in NewDNSProviderWithExec,

return NewDNSProviderWithExec(func(cmd *exec.Cmd) ([]byte, error) {
		// Limit to one gcloud command at a time. At this time we are unsure if it's
		// safe to make concurrent calls to the `gcloud` CLI to mutate DNS records
		// in the same zone. We don't mutate the same record in parallel, but we do
		// mutate different records in the same zone. See: #122180 for more details.
		gcloudMu.Lock()
		defer gcloudMu.Unlock()
		return cmd.CombinedOutput()
	})

which contends with WipeForReuse, and causes the test to time out in the process. Did we hear back from GCE support on whether the global lock is required?

goroutine 68 [semacquire, 4 minutes]:
sync.runtime_Semacquire(0x0?)
        GOROOT/src/runtime/sema.go:62 +0x25
sync.(*WaitGroup).Wait(0xc002a1ad80?)
        GOROOT/src/sync/waitgroup.go:116 +0x48
golang.org/x/sync/errgroup.(*Group).Wait(0xc001efe440)
        golang.org/x/sync/errgroup/external/org_golang_x_sync/errgroup/errgroup.go:56 +0x25
github.com/cockroachdb/cockroach/pkg/roachprod/vm.FanOutDNS({0xc005848808, 0x4, 0x26?}, 0xc001cb88a0)
        github.com/cockroachdb/cockroach/pkg/roachprod/vm/dns.go:99 +0x33c
github.com/cockroachdb/cockroach/pkg/roachprod.DestroyDNS({0x9078568, 0xc001ac5770}, 0x0?, {0xc0024c0090?, 0x4?})
        github.com/cockroachdb/cockroach/pkg/roachprod/roachprod.go:2291 +0xaa
main.(*clusterImpl).DestroyDNS(...)
        main/pkg/cmd/roachtest/cluster.go:2955
main.(*clusterImpl).WipeForReuse(0xc0014e3688, {_, _}, _, {{0x0, 0x0}, 0x4, 0x4, 0x0, 0x0, ...})
        main/pkg/cmd/roachtest/cluster.go:2943 +0x46e
main.(*testRunner).runWorker(0xc002714360, {0x9078568?, 0xc001771140?}, {0xc001c939ac, _}, _, _, _, _, {0x1, ...}, ...)
        main/pkg/cmd/roachtest/test_runner.go:631 +0x5d8
main.(*testRunner).Run.func1({0x9078568, 0xc001771140})
        main/pkg/cmd/roachtest/test_runner.go:366 +0x252
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485 +0x13a
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 1
        github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:476 +0x3fe

srosenberg avatar May 15 '24 01:05 srosenberg

@srosenberg Thanks for the extra info, Renato and I were looking at this earlier. I'll create a support ticket; I planned to only create one if there were issues around the mutex, and unfortunately it doesn't seem to scale.

herkolategan avatar May 15 '24 16:05 herkolategan

We have a support ticket open with Google https://console.cloud.google.com/support/cases/detail/v2/51236386?project=cockroach-shared.

herkolategan avatar Jun 05 '24 15:06 herkolategan