cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: tpcc/mixed-headroom/n5cpu16 failed

Open cockroach-teamcity opened this issue 1 year ago • 10 comments

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 74333311616b937fea6a995462215a1cb5962686:

(test_runner.go:1313).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/test-eng

This test on roachdash | Improve this report!

Jira issue: CRDB-42654

cockroach-teamcity avatar Oct 01 '24 19:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ ec2573dc6aaeefc226440bb2c5a7c94a63989868:

(mixedversion.go:737).Run: preparing to run step 10: failed to get cluster version for node 1 (mixed-version-tenant-uryif): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=zfs
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 02 '24 12:10 cockroach-teamcity

Removing release blocker. I think I know what's causing this failure, I'll update once I confirm (test update).

renatolabs avatar Oct 02 '24 14:10 renatolabs

An update on this issue: when I sent the message above, I thought that the issue was that the tenant was being rate limited and if we fixed that, the IMPORT would run fine (at a pace comparable to other deployment modes).

However, my attempts to stop rate limiting/throttling have so far been unsuccessful (see #131952, based on an internal thread). Even with those changes, IMPORT is still infinitely slow on the tenant, and there are several log entries like the one below:

W241001 12:53:21.151082 2076 kv/kvclient/kvcoord/dist_sender.go:2193 ⋮ [T3,Vmixed-version-tenant-xhfu8,nsql1,f‹f2b40a2a›,job=1008325319403143169,distsql.gateway=1,distsql.appname=‹$ internal-resume-job-1008325319403143169›] 287 slow range RPC: have been waiting 62.56s (1 attempts) for RPC AddSSTable [/Tenant/3/Table/112/1/‹705›/‹"ƍ>\xea2RH\x00\x80\x00\x00\x00\x01B\xba\xce"›/‹0›,/Tenant/3/Table/112/1/‹711›/‹"\xc8RG0N\x01@\x00\x80\x00\x00\x00\x01E\x9b,"›/‹0›/‹NULL›) to r554:‹/Tenant/3/Table/112/1/7{05/"ƍ>\xea2RH\x00\x80\x00\x00\x00\x01B\xba\xce"-16/"\xc9\xd3\xde_K{H\x00\x80\x00\x00\x00\x01H\r\xeb"}› [(n4,s4):1, (n3,s3):5, (n1,s1):3, next=6, gen=79, sticky=1727790472.278327789,0]; resp: ‹(err: <nil>), *kvpb.AddSSTableResponse›

There's either some other setting to be toggled to fully lift restrictions on the tenant, or something is actually wrong (I find that unlikely, since presumably people are running IMPORTs just fine out there).

TLDR: this one will need more investigation. Once we get to the bottom of this issue, we'll likely be able to enable more tests in separate-process deployments too (#130968), as I suspect that this issue is the same as the one that causes tests to time out running simple queries on the tenant.

renatolabs avatar Oct 04 '24 19:10 renatolabs

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ dcce4cafa234525fc859d32745c11ed87890dc7b:

(mixedversion.go:732).Run: preparing to run step 11: failed to get cluster version for node 1 (mixed-version-tenant-ybexf): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 07 '24 13:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 58c475d67e32b75284b4fe293bff82807c3d129d:

(test_runner.go:1308).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 09 '24 19:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ fd4b1464dbd6e385c6e51af26fe294fd2023a259:

(mixedversion.go:732).Run: mixed-version test failure while running step 10 (run "load bank dataset"): full command output in run_123824.313069581_n2_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 10 '24 12:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 30dbb173d0f083b35cf9eb8093832a5dd764c5af:

(test_runner.go:1308).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=zfs
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 12 '24 21:10 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 30dbb173d0f083b35cf9eb8093832a5dd764c5af:

(mixedversion.go:732).Run: preparing to run step 11: failed to get cluster version for node 2 (mixed-version-tenant-tyasl): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 13 '24 13:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 5be5b0b52ff79b98689b2282a8b25cf9eb50ec40:

(test_runner.go:1308).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 16 '24 20:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(test_runner.go:1310).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 20 '24 20:10 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(test_runner.go:1310).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 21 '24 20:10 cockroach-teamcity

An update on this issue: when I sent the message above, I thought that the issue was that the tenant was being rate limited and if we fixed that, the IMPORT would run fine (at a pace comparable to other deployment modes).

However, my attempts to stop rate limiting/throttling have so far been unsuccessful (see #131952, based on an internal thread). Even with those changes, IMPORT is still infinitely slow on the tenant, and there are several log entries like the one below:

It does appear that rate-limiting is the issue. During a manual run, a goroutine dump reveals a large number of quotapool.Acquire calls with waiting times ranging from 2 minutes to 38 minutes,

‹goroutine 253 [select, 10 minutes]:›
‹github.com/cockroachdb/cockroach/pkg/util/quotapool.(*AbstractPool).Acquire(, , , , )›
‹    github.com/cockroachdb/cockroach/pkg/util/quotapool/quotapool.go:281
‹github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient.(*limiter).Wait(, , , )›
‹    github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient/limiter.go:125
‹github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient.(*tenantSideCostController).OnRequestWait(, , )›
‹    github.com/cockroachdb/cockroach/pkg/ccl/multitenantccl/tenantcostclient/tenant_side.go:776
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(, , , , , , , , , )›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:2328
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(, , , , , , , , , , ...)›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1920
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(, , , , , , , , , , ...)›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1488
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(, , , )›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1104
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(, , , )›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(, , , )›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:47
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(, , , , )›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:222
‹github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(, , , )›
‹    github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:150
...

Histogram of waiting times,

8  10 minutes
16  2 minutes
1  34 minutes
3  38 minutes
1  4 minutes
1  6 minutes
5  8 minutes

Also, attempts to retrieve an auth cookie via

./cockroach auth-session login --url postgres://roachprod:[email protected]:29000 --certs-dir=certs --only-cookie --expire-after 24h roachprod

timed out,

W241022 21:05:28.507909 97988 sql/user.go:195 ⋮ [T2,Vmixed-version-tenant-kwej1,nsql1,client=10.142.0.123:43302,hostssl,user=‹roachprod›] 501  user membership lookup for ‹"roachprod"› failed: operation ‹"get-user-session"› timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get default settings error: interrupted during singleflight ‹load-value:defaultsettings-roachprod-1-1›: context deadline exceeded
W241022 21:05:28.507958 97988 sql/pgwire/auth.go:159 ⋮ [T2,Vmixed-version-tenant-kwej1,nsql1,client=10.142.0.123:43302,hostssl,user=‹roachprod›] 502  user retrieval failed for user=‹"roachprod"›: internal error while retrieving user account memberships: operation ‹"get-user-session"› timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get default settings error: interrupted during singleflight ‹load-value:defaultsettings-roachprod-1-1›: context deadline exceeded

I'll see if the issue still reproduces without the tenantcostclient limiter.

srosenberg avatar Oct 23 '24 04:10 srosenberg

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 1b5c41939197efb0fa50ded9795e2e83f5c1dd34:

(mixedversion.go:732).Run: preparing to run step 9: failed to get cluster version for node 2 (mixed-version-tenant-dh3ej): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 27 '24 14:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ becbd0fcdfa2e37a6ff23b33af70f2f91eca0790:

(mixedversion.go:732).Run: preparing to run step 8: failed to get cluster version for node 2 (mixed-version-tenant-rngvu): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 31 '24 12:10 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8f5366d09e6cf2144ca43f9cdda7e1128a13fbf8:

(test_runner.go:1339).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 10 '24 21:11 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 6610d705724a21c836f3521f75972e65d9e9e2d4:

(mixedversion.go:759).Run: preparing to run step 10: failed to get binary version for node 2 (mixed-version-tenant-5j6au): pq: internal error while retrieving user account memberships: operation "get-user-session" timed out after 10.001s (given timeout 10s): internal error while retrieving user account: get auth info error: interrupted during singleflight load-value:authinfo-roachprod-2-2: context deadline exceeded
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=zfs
  • localSSD=true
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.5 → v24.2.0 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 15 '24 14:11 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 9927a9a1f0827daa734d5eb718017cf260dfe676:

(mixedversion.go:759).Run: mixed-version test failure while running step 7 (run "load TPCC dataset"): full command output in run_102215.965977430_n3_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=shared-process
  • mvtVersions=v24.1.6 → v24.2.2 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 19 '24 10:11 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8eeb7f2ae3b2cede564b46ca47e2353fd147c061:

(mixedversion.go:759).Run: mixed-version test failure while running step 5 (run "load TPCC dataset"): full command output in run_102246.608531177_n2_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=azure
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v23.2.15 → v24.1.5 → v24.2.4 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 20 '24 10:11 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 8eeb7f2ae3b2cede564b46ca47e2353fd147c061:

(mixedversion.go:759).Run: mixed-version test failure while running step 8 (run "load TPCC dataset"): full command output in run_124909.170759333_n3_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=shared-process
  • mvtVersions=v24.2.0 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1.29-rc release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 20 '24 12:11 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ eb2d2e19eb29d2747d9e267bd0612a69d066adad:

(mixedversion.go:759).Run: mixed-version test failure while running step 7 (run "load TPCC dataset"): full command output in run_101958.568000973_n4_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=azure
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=shared-process
  • mvtVersions=v24.1.4 → v24.2.4 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 21 '24 10:11 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ eb2d2e19eb29d2747d9e267bd0612a69d066adad:

(mixedversion.go:759).Run: mixed-version test failure while running step 5 (run "load TPCC dataset"): full command output in run_135711.628952727_n1_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=system-only
  • mvtVersions=v24.1.3 → v24.2.0 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 21 '24 13:11 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ f717f6bd218121bb5e3376af658545f6bff30c22. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/422.

(test_runner.go:1363).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.3 → v24.2.0 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 25 '24 20:11 cockroach-teamcity

Small potential lead on this. I had some time over the weekend to read a bit of the managed service code for spinning up serverless clusters, mostly trying to figure out if they did anything different that we weren't doing.

I managed to get the test to finish with the following diff on top of Renato's WIP PR:

diff --git a/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go b/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go
index eedd1ef23a5..a1a39e5157f 100644
--- a/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go
+++ b/pkg/cmd/roachtest/roachtestutil/mixedversion/steps.go
@@ -272,6 +272,42 @@ func (s disableSeparateProcessThrottlingStep) Run(
 		"ALTER TENANT %q GRANT CAPABILITY exempt_from_rate_limiting = true",
 		s.virtualClusterName,
 	)
+	if err = h.System.Exec(rng, stmt); err != nil {
+		return err
+	}
+
+	rows, err := h.System.Query(rng, "SHOW TENANTS")
+	if err != nil {
+		return err
+	}
+
+	var tenantID int64
+	for rows.Next() {
+		var name string
+		var dataState string
+		var serviceMode string
+		if err := rows.Scan(&tenantID, &name, &dataState, &serviceMode); err != nil {
+			return err
+		}
+
+		if name == s.virtualClusterName {
+			break
+		}
+	}
+
+	stmt = fmt.Sprintf(
+		"SELECT crdb_internal.update_tenant_resource_limits(%d, %s, %s, %s, now(), 0); ",
+		tenantID, "1000000000000", "1000000000000", "1000000000000",
+	)
+
 	return h.System.Exec(rng, stmt)
 }
 
diff --git a/pkg/cmd/roachtest/tests/tpcc.go b/pkg/cmd/roachtest/tests/tpcc.go
index a99ea3e6278..a48cf094538 100644
--- a/pkg/cmd/roachtest/tests/tpcc.go
+++ b/pkg/cmd/roachtest/tests/tpcc.go
@@ -470,7 +470,6 @@ func runTPCCMixedHeadroom(ctx context.Context, t test.Test, c cluster.Cluster) {
 		randomNode := c.Node(c.CRDBNodes().SeededRandNode(rng)[0])
 		cmd := roachtestutil.NewCommand("%s workload fixtures import bank", test.DefaultCockroachPath).
 			Arg("{pgurl%s}", randomNode).
-			Flag("payload-bytes", 10240).
 			Flag("rows", bankRows).
 			Flag("seed", 4).
 			Flag("db", "bigbank").

The first change is giving the tenant a lot of tokens in addition to setting exempt_from_rate_limiting = true. Perhaps IMPORT doesn't respect the latter for some reason?

The second change reduces the payload size to the default of 100 bytes and is needed for the bank import to finish. Without it, the bank import still hangs forever. It seems like there is a default RU/Sec limit so maybe this can also be explained by "IMPORT doesn't respect the latter".

DarrylWong avatar Nov 26 '24 00:11 DarrylWong

The second change reduces the payload size to the default of 100 bytes and is needed for the bank import to finish. Without it, the bank import still hangs forever. It seems like there is a default RU/Sec limit so maybe this can also be explained by "IMPORT doesn't respect the latter".

Presumably, larger payloads work with serverless? Not sure what else it could be... I wonder if exempt_from_rate_limiting = true is somehow not propagated, thus falling back to kv.tenant_rate_limiter.rate_limit? But even the default value should in theory still work. Not sure if there is a more effective way to debug the "hanging", other than grabbing stack dumps (kill -3) on all nodes, to see where we're being blocked.

@stevendanna @dt Any idea what (else) we might be missing?

srosenberg avatar Nov 26 '24 01:11 srosenberg

@srosenberg for tenant rate limiting questions I'll defer to @andy-kimball.

other than grabbing stack dumps (kill -3) on all nodes, to see where we're being blocked

YMMV, but I find side-eye very powerful for diagnosing hangs, particularly when they cross process boundaries, e.g. RPC blocked on a stack on another node.

dt avatar Nov 26 '24 13:11 dt

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ b6bfe7ba6e74af63bbb7a774fe6e3f96a13eca80:

(mixedversion.go:759).Run: preparing to run step 10: failed to get cluster version for node 2 (mixed-version-tenant-xa2y4): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.1.6 → v24.2.0 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 27 '24 01:11 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ 7d48198a57f014a8828194b90098699f70f0695a. A Side-Eye cluster snapshot was captured on timeout: https://app.side-eye.io/#/snapshots/432.

(test_runner.go:1363).runTest: test timed out (7h0m0s)
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.3.0-rc.1 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 29 '24 21:11 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:

(mixedversion.go:759).Run: mixed-version test failure while running step 8 (run "load TPCC dataset"): full command output in run_140058.629867095_n1_cockroach-workload-f.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.2.4 → v24.3.0-rc.1 → master
  • runtimeAssertionsBuild=true
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136429 roachtest: tpcc/mixed-headroom/n5cpu16 failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.1-rc release-blocker]
  • #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Nov 30 '24 14:11 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ bcc993d796d03664604bf695e38fd5644d0bc952:

(mixedversion.go:759).Run: preparing to run step 7: failed to get cluster version for node 3 (mixed-version-tenant-hhnj3): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.3.0-rc.1 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136429 roachtest: tpcc/mixed-headroom/n5cpu16 failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.1-rc release-blocker]
  • #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
  • #134294 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng release-blocker]
  • #133350 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-23.1 release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Dec 01 '24 07:12 cockroach-teamcity

roachtest.tpcc/mixed-headroom/n5cpu16 failed with artifacts on master @ b3fec61ca90095c664f6432af864f18e9946f8bb:

(mixedversion.go:759).Run: preparing to run step 8: failed to get cluster version for node 3 (mixed-version-tenant-u8ylo): pq: query execution canceled
test artifacts and logs in: /artifacts/tpcc/mixed-headroom/n5cpu16/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=false
  • mvtDeploymentMode=separate-process
  • mvtVersions=v24.2.2 → v24.3.0-rc.1 → master
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136429 roachtest: tpcc/mixed-headroom/n5cpu16 failed [B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.1-rc release-blocker]
  • #136240 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3.0-rc release-blocker]
  • #133007 roachtest: tpcc/mixed-headroom/n5cpu16 failed [C-test-failure O-roachtest O-robot T-testeng branch-release-24.3]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Dec 03 '24 13:12 cockroach-teamcity