roachtest: tpcc/mixed-headroom/n5cpu16 failed
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 74333311616b937fea6a995462215a1cb5962686:
(test_runner.go:1313).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
This test on roachdash | Improve this report!
Jira issue: CRDB-42654
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ ec2573dc6aaeefc226440bb2c5a7c94a63989868:
(mixedversion.go:737).Run: preparing to run step 10: failed to get cluster version for node 1 (mixed-version-tenant-uryif): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1
Parameters:
ROACHTEST_arch=arm64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=zfsROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Removing release blocker. I think I know what's causing this failure, I'll update once I confirm (test update).
An update on this issue: when I sent the message above, I thought that the issue was that the tenant was being rate limited and if we fixed that, the IMPORT would run fine (at a pace comparable to other deployment modes).
However, my attempts to stop rate limiting/throttling have so far been unsuccessful (see #131952, based on an internal thread). Even with those changes, IMPORT is still infinitely slow on the tenant, and there are several log entries like the one below:
W241001 12:53:21.151082 2076 kv/kvclient/kvcoord/dist_sender.go:2193 ⋮ [T3,Vmixed-version-tenant-xhfu8,nsql1,f‹f2b40a2a›,job=1008325319403143169,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-1008325319403143169›] 287 slow range RPC: have been waiting 62.56s (1 attempts) for RPC AddSSTable [/Tenant/3/Table/112/1/‹705›/‹"ƍ>\xea2RH\x00\x80\x00\x00\x00\x01B\xba\xce"›/‹0›,/Tenant/3/Table/112/1/‹711›/‹"\xc8RG0N\x01@\x00\x80\x00\x00\x00\x01E\x9b,"›/‹0›/‹NULL›) to r554:‹/Tenant/3/Table/112/1/7{05/"ƍ>\xea2RH\x00\x80\x00\x00\x00\x01B\xba\xce"-16/"\xc9\xd3\xde_K{H\x00\x80\x00\x00\x00\x01H\r\xeb"}› [(n4,s4):1, (n3,s3):5, (n1,s1):3, next=6, gen=79, sticky=1727790472.278327789,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›
There's either some other setting to be toggled to fully lift restrictions on the tenant, or something is actually wrong (I find that unlikely, since presumably people are running IMPORTs just fine out there).
TLDR: this one will need more investigation. Once we get to the bottom of this issue, we'll likely be able to enable more tests in separate-process deployments too (#130968), as I suspect that this issue is the same as the one that causes tests to time out running simple queries on the tenant.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ dcce4cafa234525fc859d32745c11ed87890dc7b:
(mixedversion.go:732).Run: preparing to run step 11: failed to get cluster version for node 1 (mixed-version-tenant-ybexf): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 58c475d67e32b75284b4fe293bff82807c3d129d:
(test_runner.go:1308).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ fd4b1464dbd6e385c6e51af26fe294fd2023a259:
(mixedversion.go:732).Run: mixed-version test failure while running step 10 (run "load bank dataset"): full command output in run_123824.313069581_n2_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 30dbb173d0f083b35cf9eb8093832a5dd764c5af:
(test_runner.go:1308).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=zfsROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 30dbb173d0f083b35cf9eb8093832a5dd764c5af:
(mixedversion.go:732).Run: preparing to run step 11: failed to get cluster version for node 2 (mixed-version-tenant-tyasl): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=trueROACHTEST_ssd=0
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 5be5b0b52ff79b98689b2282a8b25cf9eb50ec40:
(test_runner.go:1308).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(test_runner.go:1310).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1
Parameters:
ROACHTEST_arch=arm64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3 release-blocker]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(test_runner.go:1310).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=trueROACHTEST_ssd=0
Same failure on other branches
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3 release-blocker]
An update on this issue: when I sent the message above, I thought that the issue was that the tenant was being rate limited and if we fixed that, the
IMPORTwould run fine (at a pace comparable to other deployment modes).However, my attempts to stop rate limiting/throttling have so far been unsuccessful (see #131952, based on an internal thread). Even with those changes,
IMPORTis still infinitely slow on the tenant, and there are several log entries like the one below:
It does appear that rate-limiting is the issue. During a manual run, a goroutine dump reveals a large number of quotapool.Acquire calls with waiting times ranging from 2 minutes to 38 minutes,
‹goroutine 253 [select, 10 minutes]:›
‹github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(, , , , )›
‹ github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:281
‹github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient.(*limiter).Wait(, , , )›
‹ github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient/limiter.go:125
‹github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient.(*tenantSideCostController).OnRequestWait(, , )›
‹ github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go:776
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(, , , , , , , , , )›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2328
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(, , , , , , , , , , ...)›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1920
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(, , , , , , , , , , ...)›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1488
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(, , , )›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1104
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(, , , )›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(, , , )›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:47
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(, , , , )›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(, , , )›
‹ github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:150
...
Histogram of waiting times,
8 10 minutes
16 2 minutes
1 34 minutes
3 38 minutes
1 4 minutes
1 6 minutes
5 8 minutes
Also, attempts to retrieve an auth cookie via
./cockroach auth-session login --url postgres://roachprod:[email protected]:29000 --certs-dir=certs --only-cookie --expire-after 24h roachprod
timed out,
W241022 21:05:28.507909 97988 sql/user.go:195 ⋮ [T2,Vmixed-version-tenant-kwej1,nsql1,client=10.142.0.123:43302,hostssl,user=‹roachprod›] 501 user membership lookup for ‹"roachprod"› failed: operation ‹"get-user-session"› timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get default settings error: interrupted during singleflight ‹load-value:defaultsettings-roachprod-1-1›: context deadline exceeded
W241022 21:05:28.507958 97988 sql/pgwire/auth.go:159 ⋮ [T2,Vmixed-version-tenant-kwej1,nsql1,client=10.142.0.123:43302,hostssl,user=‹roachprod›] 502 user retrieval failed for user=‹"roachprod"›: internal error while retrieving user account memberships: operation ‹"get-user-session"› timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get default settings error: interrupted during singleflight ‹load-value:defaultsettings-roachprod-1-1›: context deadline exceeded
I'll see if the issue still reproduces without the tenantcostclient limiter.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 1b5c41939197efb0fa50ded9795e2e83f5c1dd34:
(mixedversion.go:732).Run: preparing to run step 9: failed to get cluster version for node 2 (mixed-version-tenant-dh3ej): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ becbd0fcdfa2e37a6ff23b33af70f2f91eca0790:
(mixedversion.go:732).Run: preparing to run step 8: failed to get cluster version for node 2 (mixed-version-tenant-rngvu): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8f5366d09e6cf2144ca43f9cdda7e1128a13fbf8:
(test_runner.go:1339).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:
(mixedversion.go:759).Run: preparing to run step 10: failed to get binary version for node 2 (mixed-version-tenant-5j6au): pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight load-value:authinfo-roachprod-2-2: context deadline exceeded
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=zfslocalSSD=truemvtDeploymentMode=separate-processmvtVersions=v24.1.5 → v24.2.0 → masterruntimeAssertionsBuild=truessd=0
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 9927a9a1f0827daa734d5eb718017cf260dfe676:
(mixedversion.go:759).Run: mixed-version test failure while running step 7 (run "load TPCC dataset"): full command output in run_102215.965977430_n3_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=azurecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=shared-processmvtVersions=v24.1.6 → v24.2.2 → masterruntimeAssertionsBuild=falsessd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8eeb7f2ae3b2cede564b46ca47e2353fd147c061:
(mixedversion.go:759).Run: mixed-version test failure while running step 5 (run "load TPCC dataset"): full command output in run_102246.608531177_n2_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=azurecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=system-onlymvtVersions=v23.2.15 → v24.1.5 → v24.2.4 → masterruntimeAssertionsBuild=falsessd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8eeb7f2ae3b2cede564b46ca47e2353fd147c061:
(mixedversion.go:759).Run: mixed-version test failure while running step 8 (run "load TPCC dataset"): full command output in run_124909.170759333_n3_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemvtDeploymentMode=shared-processmvtVersions=v24.2.0 → masterruntimeAssertionsBuild=truessd=0
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ eb2d2e19eb29d2747d9e267bd0612a69d066adad:
(mixedversion.go:759).Run: mixed-version test failure while running step 7 (run "load TPCC dataset"): full command output in run_101958.568000973_n4_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1
Parameters:
arch=arm64cloud=azurecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=shared-processmvtVersions=v24.1.4 → v24.2.4 → masterruntimeAssertionsBuild=falsessd=0
Help
See: roachtest README
See: How To Investigate (internal)
Grafana is not yet available for azure clusters
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ eb2d2e19eb29d2747d9e267bd0612a69d066adad:
(mixedversion.go:759).Run: mixed-version test failure while running step 5 (run "load TPCC dataset"): full command output in run_135711.628952727_n1_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemvtDeploymentMode=system-onlymvtVersions=v24.1.3 → v24.2.0 → masterruntimeAssertionsBuild=truessd=0
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ f717f6bd218121bb5e3376af658545f6bff30c22. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/422.
(test_runner.go:1363).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=separate-processmvtVersions=v24.1.3 → v24.2.0 → masterruntimeAssertionsBuild=falsessd=0
Same failure on other branches
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
Small potential lead on this. I had some time over the weekend to read a bit of the managed service code for spinning up serverless clusters, mostly trying to figure out if they did anything different that we weren't doing.
I managed to get the test to finish with the following diff on top of Renato's WIP PR:
diff --git a/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go b/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go
index eedd1ef23a5..a1a39e5157f 100644
--- a/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go
+++ b/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go
@@ -272,6 +272,42 @@ func (s disableSeparateProcessThrottlingStep) Run(
"ALTER TENANT %q GRANT CAPABILITY exempt_from_rate_limiting = true",
s.virtualClusterName,
)
+ if err = h.System.Exec(rng, stmt); err != nil {
+ return err
+ }
+
+ rows, err := h.System.Query(rng, "SHOW TENANTS")
+ if err != nil {
+ return err
+ }
+
+ var tenantID int64
+ for rows.Next() {
+ var name string
+ var dataState string
+ var serviceMode string
+ if err := rows.Scan(&tenantID, &name, &dataState, &serviceMode); err != nil {
+ return err
+ }
+
+ if name == s.virtualClusterName {
+ break
+ }
+ }
+
+ stmt = fmt.Sprintf(
+ "SELECT crdb_internal.update_tenant_resource_limits(%d, %s, %s, %s, now(), 0); ",
+ tenantID, "1000000000000", "1000000000000", "1000000000000",
+ )
+
return h.System.Exec(rng, stmt)
}
diff --git a/pkg/cmd/roachtest/tests/tpcc.go b/pkg/cmd/roachtest/tests/tpcc.go
index a99ea3e6278..a48cf094538 100644
--- a/pkg/cmd/roachtest/tests/tpcc.go
+++ b/pkg/cmd/roachtest/tests/tpcc.go
@@ -470,7 +470,6 @@ func runTPCCMixedHeadroom(ctx context.Context, t test.Test, c cluster.Cluster) {
randomNode := c.Node(c.CRDBNodes().SeededRandNode(rng)[0])
cmd := roachtestutil.NewCommand("%s workload fixtures import bank", test.DefaultCockroachPath).
Arg("{pgurl%s}", randomNode).
- Flag("payload-bytes", 10240).
Flag("rows", bankRows).
Flag("seed", 4).
Flag("db", "bigbank").
The first change is giving the tenant a lot of tokens in addition to setting exempt_from_rate_limiting = true. Perhaps IMPORT doesn't respect the latter for some reason?
The second change reduces the payload size to the default of 100 bytes and is needed for the bank import to finish. Without it, the bank import still hangs forever. It seems like there is a default RU/Sec limit so maybe this can also be explained by "IMPORT doesn't respect the latter".
The second change reduces the payload size to the default of 100 bytes and is needed for the bank import to finish. Without it, the bank import still hangs forever. It seems like there is a default RU/Sec limit so maybe this can also be explained by "IMPORT doesn't respect the latter".
Presumably, larger payloads work with serverless? Not sure what else it could be... I wonder if exempt_from_rate_limiting = true is somehow not propagated, thus falling back to kv.tenant_rate_limiter.rate_limit? But even the default value should in theory still work. Not sure if there is a more effective way to debug the "hanging", other than grabbing stack dumps (kill -3) on all nodes, to see where we're being blocked.
@stevendanna @dt Any idea what (else) we might be missing?
@srosenberg for tenant rate limiting questions I'll defer to @andy-kimball.
other than grabbing stack dumps (kill -3) on all nodes, to see where we're being blocked
YMMV, but I find side-eye very powerful for diagnosing hangs, particularly when they cross process boundaries, e.g. RPC blocked on a stack on another node.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ b6bfe7ba6e74af63bbb7a774fe6e3f96a13eca80:
(mixedversion.go:759).Run: preparing to run step 10: failed to get cluster version for node 2 (mixed-version-tenant-xa2y4): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1
Parameters:
arch=arm64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=falsemvtDeploymentMode=separate-processmvtVersions=v24.1.6 → v24.2.0 → masterruntimeAssertionsBuild=falsessd=0
Same failure on other branches
- #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 7d48198a57f014a8828194b90098699f70f0695a. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/432.
(test_runner.go:1363).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=separate-processmvtVersions=v24.3.0-rc.1 → masterruntimeAssertionsBuild=falsessd=0
Same failure on other branches
- #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:
(mixedversion.go:759).Run: mixed-version test failure while running step 8 (run "load TPCC dataset"): full command output in run_140058.629867095_n1_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=separate-processmvtVersions=v24.2.4 → v24.3.0-rc.1 → masterruntimeAssertionsBuild=truessd=0
Same failure on other branches
- #136429 roachtest: tpcc/mixed-headroom/n5cpu16 failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.1-rc release-blocker]
- #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:
(mixedversion.go:759).Run: preparing to run step 7: failed to get cluster version for node 3 (mixed-version-tenant-hhnj3): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemvtDeploymentMode=separate-processmvtVersions=v24.3.0-rc.1 → masterruntimeAssertionsBuild=falsessd=0
Same failure on other branches
- #136429 roachtest: tpcc/mixed-headroom/n5cpu16 failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.1-rc release-blocker]
- #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
- #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
- #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]
roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ b3fec61ca90095c664f6432af864f18e9946f8bb:
(mixedversion.go:759).Run: preparing to run step 8: failed to get cluster version for node 3 (mixed-version-tenant-u8ylo): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1
Parameters:
arch=arm64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=falsemvtDeploymentMode=separate-processmvtVersions=v24.2.2 → v24.3.0-rc.1 → masterruntimeAssertionsBuild=falsessd=0
Same failure on other branches
- #136429 roachtest: tpcc/mixed-headroom/n5cpu16 failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.1-rc release-blocker]
- #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
- #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]