cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

ccl/backupccl: TestBackupRestoreExecLocality failed

Open cockroach-teamcity opened this issue 1 year ago • 1 comments
trafficstars

ccl/backupccl.TestBackupRestoreExecLocality failed on master @ d13253527955eaa2da09394b8a2729627ab25c48:

Fatal error:

panic: Counters should not decrease [recovered]
	panic: Counters should not decrease

Stack:

goroutine 282 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).recover(0xc008eea290?, {0xe06fb98, 0xc0013938f0})
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:231 +0x74
panic({0x95c61c0?, 0xe002c30?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Update(0xc006e3bd40, 0x13bdb91)
	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:755 +0x6d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreMetrics).updateEngineMetrics(_, {0xc00edfb608, {0x5bae427, 0x59e182d, 0x14e2be4, 0x788d, 0xd2a4, 0x77c1, 0x1d451}, {0x33f3, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:3793 +0x1a2e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics(_, {_, _})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3478 +0x1d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetricsPeriodically(_, {_, _}, _, _)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3503 +0xb0
github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically.func1(0xc0066a3408)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1035 +0xe6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x97c62c0?, 0xc0066a3408)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:150 +0x58
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc001409d78, 0xc008eebc00)
	github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:385 +0x19d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc001409d40, 0xc017f93c48)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149 +0x8e
github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically(0xc002f4b808, {0xe06fb98, 0xc0013938f0}, 0xc017f93e40, 0x6)
	github.com/cockroachdb/cockroach/pkg/server/node.go:1034 +0xcf
github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func2({0xe06fb98, 0xc0013938f0})
	github.com/cockroachdb/cockroach/pkg/server/node.go:1007 +0x273
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2()
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485 +0x263
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx in goroutine 9
	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:476 +0x69d
Log preceding fatal error

=== RUN   TestBackupRestoreExecLocality
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestBackupRestoreExecLocality3188356672
    test_log_scope.go:81: use -show-logs to present logs inline
*
* ERROR: a panic has occurred!
* panic: Counters should not decrease
* (1) attached stack trace
*   -- stack trace:
*   | runtime.gopanic
*   | 	GOROOT/src/runtime/panic.go:770
*   | github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Update
*   | 	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:755
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreMetrics).updateEngineMetrics
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:3793
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3478
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetricsPeriodically
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3503
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically.func1
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1035
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:150
*   | github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range
*   | 	github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:385
*   | github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores
*   | 	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).computeMetricsPeriodically
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1034
*   | github.com/cockroachdb/cockroach/pkg/server.(*Node).startComputePeriodicMetrics.func2
*   | 	github.com/cockroachdb/cockroach/pkg/server/node.go:1007
*   | github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2
*   | 	github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:485
*   | runtime.goexit
*   | 	src/runtime/asm_amd64.s:1695
* Wraps: (2) panic: Counters should not decrease
* Error types: (1) *withstack.withStack (2) *errutil.leafError
*

Parameters:

  • attempt=1
  • race=true
  • run=2
  • shard=22
Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-39937

cockroach-teamcity avatar Jun 30 '24 12:06 cockroach-teamcity

@aadityasondhi did you recently fix a similar issue?

dt avatar Jun 30 '24 18:06 dt

This one fails on sm.WALBytesIn.

aadityasondhi avatar Jul 02 '24 20:07 aadityasondhi

This is a little confusing to me. In pebble, this metric is an atomic that we only increment using a uint.

Here is where it is set. https://github.com/cockroachdb/pebble/blob/200f9cf1e217afc4e9052891126be661d757007a/db.go#L1974

Here is where it is incremented. https://github.com/cockroachdb/pebble/blob/200f9cf1e217afc4e9052891126be661d757007a/db.go#L973

aadityasondhi avatar Jul 02 '24 20:07 aadityasondhi

Is it perhaps possible that CockroachDB collects and updated metrics in parallel? It's possible that two goroutines get the metrics at around the same time but then they update the counters in the reverse order.

Indeed, I see another goroutine inside ComputeMetrics in the test failure:

github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicastats.(*ReplicaStats).AverageRatePerSecond(0xc0056aa6c0, {0x17ddc3a65714f7c7?, 0xc008afa4f0?, 0x0?})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicastats/replica_stats.go:357 +0x22b
github.com/cockroachdb/cockroach/pkg/kv/kvserver/load.(*ReplicaLoad).getLocked(0xc008afa4e0, 0x3)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/load/replica_load.go:184 +0x98
github.com/cockroachdb/cockroach/pkg/kv/kvserver/load.(*ReplicaLoad).Stats(0xc008afa4e0)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/load/replica_load.go:199 +0x12a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).updateReplicationGauges.func1(0xc00a8b1608)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3282 +0x872
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*storeReplicaVisitor).Visit(0xc005230b40, 0xc013cf8eb8)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:556 +0x326
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).updateReplicationGauges(0xc005a60c08, {0xe06fb98, 0xc006a0db00})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3231 +0x92f
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics(_, {_, _})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3471 +0xe6
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetrics(...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3590
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication.func2(0xc005a60c08)
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1443 +0x16e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x97c62c0?, 0xc005a60c08)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:150 +0x58
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc003ff4db8, 0xc013cf9710)
	github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:385 +0x19d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc003ff4d80, 0xc013cf9928)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149 +0x8e
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication(0xc004227808)
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1432 +0x5ae
github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backuptestutils.StartBackupRestoreTestCluster({0xe0facf0, 0xc0022ba680}, 0x4, {0xc002f58c30, 0x3, 0x4b14d0?})
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backuptestutils/testutils.go:162 +0x967
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.backupRestoreTestSetupWithParams({_, _}, _, _, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/utils_test.go:78 +0x1bf
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestBackupRestoreExecLocality(0xc0022ba680)
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_test.go:477 +0x8da

I think we should add some synchronization around the entire ComputeMetrics operation.

RaduBerinde avatar Jul 03 '24 04:07 RaduBerinde

ccl/backupccl.TestBackupRestoreExecLocality failed on master @ c557fb59f6aec659d364e9002fc083c59c6392b6:

Fatal error:

panic: Counters should not decrease [recovered]
	panic: Counters should not decrease [recovered]
	panic: Counters should not decrease [recovered]
	panic: Counters should not decrease

Stack:

goroutine 52 [running]:
testing.tRunner.func1.2({0x95e58a0, 0xe02eb10})
	GOROOT/src/testing/testing.go:1631 +0x3f7
testing.tRunner.func1()
	GOROOT/src/testing/testing.go:1634 +0x6b6
panic({0x95e58a0?, 0xe02eb10?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/cockroachdb/cockroach/pkg/util/leaktest.AfterTest.func2()
	github.com/cockroachdb/cockroach/pkg/util/leaktest/leaktest.go:133 +0x4fa
panic({0x95e58a0?, 0xe02eb10?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start.func1()
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:393 +0x99
panic({0x95e58a0?, 0xe02eb10?})
	GOROOT/src/runtime/panic.go:770 +0x132
github.com/cockroachdb/cockroach/pkg/util/metric.(*Counter).Update(0xc006905920, 0x998c0ba)
	github.com/cockroachdb/cockroach/pkg/util/metric/pkg/util/metric/metric.go:755 +0x6d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*StoreMetrics).updateEngineMetrics(_, {0xc00e986408, {0x998c0ba, 0x2fe7c34, 0x55320878, 0x6a35, 0xc897, 0x6969, 0x140d1}, {0x2d16, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/metrics.go:3790 +0xfd7
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).computeMetrics(_, {_, _})
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3487 +0x1d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).ComputeMetrics(...)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store.go:3599
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication.func2(0xc00692ac08)
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1443 +0x16e
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores.func1(0x97e6560?, 0xc00692ac08)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:150 +0x58
github.com/cockroachdb/cockroach/pkg/util/syncutil.(*IntMap).Range(0xc004067988, 0xc01245cee0)
	github.com/cockroachdb/cockroach/pkg/util/syncutil/int_map.go:385 +0x19d
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).VisitStores(0xc004067950, 0xc016db10f8)
	github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/stores.go:149 +0x8e
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).WaitForFullReplication(0xc003c78308)
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:1432 +0x5ae
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.(*TestCluster).Start(0xc003c78308, {0xe0db2d0, 0xc003c45a00})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:456 +0xa38
github.com/cockroachdb/cockroach/pkg/testutils/testcluster.StartTestCluster({_, _}, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/testutils/testcluster/testcluster.go:238 +0x92
github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backuptestutils.StartBackupRestoreTestCluster({0xe126fb0, 0xc003c45a00}, 0x4, {0xc00418ec30, 0x3, 0x4b14d0?})
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backuptestutils/testutils.go:130 +0x313
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.backupRestoreTestSetupWithParams({_, _}, _, _, _, {{{{0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/utils_test.go:78 +0x1bf
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.TestBackupRestoreExecLocality(0xc003c45a00)
	github.com/cockroachdb/cockroach/pkg/ccl/backupccl/backup_test.go:477 +0x8da
testing.tRunner(0xc003c45a00, 0xa6f5fd8)
	GOROOT/src/testing/testing.go:1689 +0x21f
created by testing.(*T).Run in goroutine 1
	GOROOT/src/testing/testing.go:1742 +0x826
Log preceding fatal error

=== RUN   TestBackupRestoreExecLocality
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestBackupRestoreExecLocality1780957092
    test_log_scope.go:81: use -show-logs to present logs inline
    testcluster.go:393: -- test log scope end --

ERROR: a panic has occurred!
Details cannot be printed yet because we are still unwinding.
Hopefully the test harness prints the panic below, otherwise check the test logs.

test logs left over in: outputs.zip/logTestBackupRestoreExecLocality1780957092
    testcluster.go:393: panic: Counters should not decrease
--- FAIL: TestBackupRestoreExecLocality (102.19s)

Parameters:

  • attempt=1
  • race=true
  • run=3
  • shard=22
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jul 06 '24 12:07 cockroach-teamcity