roachtest: perturbation/metamorphic/addNode failed
roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ 0c0af9540ed3f9d63eba523bc870eeb6c7eebe90:
(assertions.go:363).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:734
main/pkg/cmd/roachtest/test_runner.go:1279
src/runtime/asm_amd64.s:1695
Error: Should be true
Test: perturbation/metamorphic/addNode
Messages: FAILURE: follower-read : Increase 4.7381 > 3.0000 BASE: 6.569274ms SCORE: 31.125553ms
FAILURE: read : Increase 4.6191 > 3.0000 BASE: 6.452233ms SCORE: 29.803462ms
FAILURE: write : Increase 4.1497 > 3.0000 BASE: 25.617342ms SCORE: 106.303619ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=32ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=1
This test on roachdash | Improve this report!
Jira issue: CRDB-42717
Looks legitimate given the thresholds, P99.99 SQL latency does spike extremely high at one point.
I wonder though, is it just pushing these nodes' disk too hard? The latency is all from disk:
i.e., I don't think the provisioning could possibly handle this:
This is probably a case for the snapshot bandwidth integration cc @aadityasondhi.
@andrewbaptist should we bump these again for metamorphic to 5x?
This was an interesting metamorphic configuration:
2024/10/03 14:16:17 admission_control_latency.go:623: test variations are: seed: -339070793733502206, fillDuration: 10m0s, maxBlockBytes: 4096, perturbationDuration: 10s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 1, numNodes: 12, numWorkloadNodes: 1, partitionSite: true, vcpu: 32, disks: 1, leaseType: epoch, cloud: gce
12 nodes 32 vCPU 1 local disk 4K value size no pre-splits
So this configuration would push this disk pretty hard. I'm reluctant to increase the passing criteria as it is good for it to explore new configurations.
Lets leave this open and watch this test a little longer to see if there are other alternatives.
Looking briefly at one of the traces, I think the slow write is due to a split:
1.094ms 0.080ms event:kv/kvserver/spanlatch/manager.go:579 [n7,s7,r192/5:/Table/106/1/-7{98210…-84942…}] waiting to acquire read latch /Local/Range/Table/106/1/-7982100710300276191/RangeDescriptor@0,0 for request Put [/Table/106/1/-7869379767752149903/0], EndTxn(commit) [/Table/106/1/-7869379767752149903/0], [txn: 5ec3635d], [can-forward-ts], held by write latch /Local/Range/Table/106/1/-7982100710300276191/RangeDescriptor@0,0 for request ConditionalPut [/Local/Range/Table/106/1/-7982100710300276191/RangeDescriptor], [txn: e45553a2]
1579.694ms 1578.600ms event:kv/kvserver/concurrency/concurrency_manager.go:334 [n7,s7,r192/5:/Table/106/1/-7{98210…-84942…}] scanning lock table for conflicting locks
1579.697ms 0.004ms === operation:executeWriteBatch _verbose:1 node:7 store:7 range:192/5:/Table/106/1/-7{98210…-84942…}
Maybe this is still slow during a overloaded disk.
roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ fd4b1464dbd6e385c6e51af26fe294fd2023a259:
(assertions.go:363).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:732
main/pkg/cmd/roachtest/test_runner.go:1279
src/runtime/asm_amd64.s:1695
Error: Should be true
Test: perturbation/metamorphic/addNode
Messages: FAILURE: write : Increase 4.3916 > 3.0000 BASE: 13.128903ms SCORE: 57.657366ms
FAILURE: write : Increase 6.8382 > 3.0000 BASE: 13.128903ms SCORE: 89.778156ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=32ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=1
roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ 30dbb173d0f083b35cf9eb8093832a5dd764c5af:
(assertions.go:363).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:838
main/pkg/cmd/roachtest/test_runner.go:1279
src/runtime/asm_amd64.s:1700
Error: Should be true
Test: perturbation/metamorphic/addNode
Messages: FAILURE: write : Increase 3.8275 > 3.0000 BASE: 30.777004ms SCORE: 117.80036ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=4ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=1
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(assertions.go:363).Fail:
Error Trace: github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:851
main/pkg/cmd/roachtest/test_runner.go:1281
src/runtime/asm_amd64.s:1695
Error: Should be true
Test: perturbation/metamorphic/addNode
Messages: FAILURE: follower-read : Increase 522204.4221 > 5.0000 BASE: 6.893852ms SCORE: 1h0m0s
FAILURE: read : Increase 530359.0015 > 5.0000 BASE: 6.787855ms SCORE: 1h0m0s
FAILURE: write : Increase 225838.0143 > 5.0000 BASE: 15.940629ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=32ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=trueROACHTEST_ssd=1
This test has been stable after chaging the pass criteria to 5.0 until the recent failure. I'm adding gctrace to the tests to be able to determine if it is related or rule it out. See this slack discussion for details. https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1729529769923819
This feels a little like an infra-flake, but there isn't enough to say so conclusively. I'll continue to monitor and leave this as a P-3.
For reference (in case logs are gone), the failure on 10/10 had the following configuration:
2024/10/10 12:55:15 admission_control_latency.go:621: test variations are: seed: 2236907502422249199, fillDuration: 10m0s, maxBlockBytes: 4096, perturbationDuration: 10s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 10000, numNodes: 5, numWorkloadNodes: 1, partitionSite: true, vcpu: 32, disks: 1, leaseType: epoch, cloud: gce
Closing now that this has been stable for another ~10 runs.