cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: perturbation/metamorphic/addNode failed

Open cockroach-teamcity opened this issue 1 year ago • 7 comments

roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ 0c0af9540ed3f9d63eba523bc870eeb6c7eebe90:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:734
	            				main/pkg/cmd/roachtest/test_runner.go:1279
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/addNode
	Messages:   	FAILURE: follower-read  : Increase 4.7381 > 3.0000 BASE: 6.569274ms SCORE: 31.125553ms
	            	
	            	FAILURE: read           : Increase 4.6191 > 3.0000 BASE: 6.452233ms SCORE: 29.803462ms
	            	
	            	FAILURE: write          : Increase 4.1497 > 3.0000 BASE: 25.617342ms SCORE: 106.303619ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=32
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-42717

cockroach-teamcity avatar Oct 03 '24 14:10 cockroach-teamcity

image Looks legitimate given the thresholds, P99.99 SQL latency does spike extremely high at one point.

kvoli avatar Oct 03 '24 15:10 kvoli

I wonder though, is it just pushing these nodes' disk too hard? The latency is all from disk:

image

i.e., I don't think the provisioning could possibly handle this:

image

This is probably a case for the snapshot bandwidth integration cc @aadityasondhi.

@andrewbaptist should we bump these again for metamorphic to 5x?

kvoli avatar Oct 03 '24 15:10 kvoli

This was an interesting metamorphic configuration:

2024/10/03 14:16:17 admission_control_latency.go:623: test variations are: seed: -339070793733502206, fillDuration: 10m0s, maxBlockBytes: 4096, perturbationDuration: 10s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 1, numNodes: 12, numWorkloadNodes: 1, partitionSite: true, vcpu: 32, disks: 1, leaseType: epoch, cloud: gce

12 nodes 32 vCPU 1 local disk 4K value size no pre-splits

So this configuration would push this disk pretty hard. I'm reluctant to increase the passing criteria as it is good for it to explore new configurations.

Lets leave this open and watch this test a little longer to see if there are other alternatives.

andrewbaptist avatar Oct 03 '24 16:10 andrewbaptist

Looking briefly at one of the traces, I think the slow write is due to a split:

1.094ms      0.080ms                                event:kv/kvserver/spanlatch/manager.go:579 [n7,s7,r192/5:/Table/106/1/-7{98210…-84942…}] waiting to acquire read latch /Local/Range/Table/106/1/-7982100710300276191/RangeDescriptor@0,0 for request Put [/Table/106/1/-7869379767752149903/0], EndTxn(commit) [/Table/106/1/-7869379767752149903/0], [txn: 5ec3635d], [can-forward-ts], held by write latch /Local/Range/Table/106/1/-7982100710300276191/RangeDescriptor@0,0 for request ConditionalPut [/Local/Range/Table/106/1/-7982100710300276191/RangeDescriptor], [txn: e45553a2]
1579.694ms   1578.600ms                                event:kv/kvserver/concurrency/concurrency_manager.go:334 [n7,s7,r192/5:/Table/106/1/-7{98210…-84942…}] scanning lock table for conflicting locks
1579.697ms      0.004ms                                    === operation:executeWriteBatch _verbose:1 node:7 store:7 range:192/5:/Table/106/1/-7{98210…-84942…}

Maybe this is still slow during a overloaded disk.

andrewbaptist avatar Oct 03 '24 16:10 andrewbaptist

roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ fd4b1464dbd6e385c6e51af26fe294fd2023a259:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:732
	            				main/pkg/cmd/roachtest/test_runner.go:1279
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/addNode
	Messages:   	FAILURE: write          : Increase 4.3916 > 3.0000 BASE: 13.128903ms SCORE: 57.657366ms
	            	
	            	FAILURE: write          : Increase 6.8382 > 3.0000 BASE: 13.128903ms SCORE: 89.778156ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=32
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 10 '24 13:10 cockroach-teamcity

roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ 30dbb173d0f083b35cf9eb8093832a5dd764c5af:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:838
	            				main/pkg/cmd/roachtest/test_runner.go:1279
	            				src/runtime/asm_amd64.s:1700
	Error:      	Should be true
	Test:       	perturbation/metamorphic/addNode
	Messages:   	FAILURE: write          : Increase 3.8275 > 3.0000 BASE: 30.777004ms SCORE: 117.80036ms
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=4
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 12 '24 14:10 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.perturbation/metamorphic/addNode failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(assertions.go:363).Fail: 
	Error Trace:	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/admission_control_latency.go:851
	            				main/pkg/cmd/roachtest/test_runner.go:1281
	            				src/runtime/asm_amd64.s:1695
	Error:      	Should be true
	Test:       	perturbation/metamorphic/addNode
	Messages:   	FAILURE: follower-read  : Increase 522204.4221 > 5.0000 BASE: 6.893852ms SCORE: 1h0m0s
	            	
	            	FAILURE: read           : Increase 530359.0015 > 5.0000 BASE: 6.787855ms SCORE: 1h0m0s
	            	
	            	FAILURE: write          : Increase 225838.0143 > 5.0000 BASE: 15.940629ms SCORE: 1h0m0s
(require.go:1950).True: FailNow called
test artifacts and logs in: /artifacts/perturbation/metamorphic/addNode/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=32
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=1
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 20 '24 16:10 cockroach-teamcity

This test has been stable after chaging the pass criteria to 5.0 until the recent failure. I'm adding gctrace to the tests to be able to determine if it is related or rule it out. See this slack discussion for details. https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1729529769923819

This feels a little like an infra-flake, but there isn't enough to say so conclusively. I'll continue to monitor and leave this as a P-3.

andrewbaptist avatar Oct 21 '24 18:10 andrewbaptist

For reference (in case logs are gone), the failure on 10/10 had the following configuration:

2024/10/10 12:55:15 admission_control_latency.go:621: test variations are: seed: 2236907502422249199, fillDuration: 10m0s, maxBlockBytes: 4096, perturbationDuration: 10s, validationDuration: 5m0s, ratioOfMax: 0.500000, splits: 10000, numNodes: 5, numWorkloadNodes: 1, partitionSite: true, vcpu: 32, disks: 1, leaseType: epoch, cloud: gce

andrewbaptist avatar Oct 21 '24 18:10 andrewbaptist

Closing now that this has been stable for another ~10 runs.

andrewbaptist avatar Oct 28 '24 18:10 andrewbaptist