cockroach
cockroach copied to clipboard
roachtest: c2c/shutdown/dest/coordinator failed
roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ e5b1d125bf8cde9b4d47f4303b8a76ec735ca082:
(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.290093216s vs 2m0s
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2]
- #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]
This test on roachdash | Improve this report!
Jira issue: CRDB-43245
roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ 39bae3f4961c14c890d6140e9268d2fbf0ca324a:
(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.852996135s vs 2m0s
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2]
- #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]
There may actually be smoke here. the max latency from this shutdown isn't actually a hair over 2 minutes-- rather, the latency verifier exits as soon as it sees latency above 2 minutes....
To construct a timeline of this test:
- 6:47:47 node shutdown occurs on node 7
06:47:48 monitor.go:203: Monitor event: n7: error: no cockroach processes running
- 6:48:46 node 5 picks up the job, i.e. a minute after node 7 shut the job down. the base adoption interval is 30 seconds, so i'm not sure why a node didn't pick up the job after the first 30 seconds had passed.
5.unredacted/cockroach.log:I241016 06:48:46.980619 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:408 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 520 empty cutover time, no revert required
5.unredacted/cockroach.log:I241016 06:48:46.986365 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:93 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 521 resuming stream (producer job 1012501106433556484) from 1729061216.168387056,0
- Furthermore, by the time node 5 picks up the job, the replication lag is already at 1m:47s, so the stream only has 13 seconds to complete it's catchups scans, before the latency verifier fails
06:48:43 latency_verifier.go:140: test status: watching stream-ingestion: end-to-end steady latency 1m47.788s; max steady latency so far 1m47.788s; highwater 2024-10-16 06:46:56.168387056 +0000 UTC
Given that cutover succeeded, the stream was able to catch up after the node shutdown.
Next steps i think are:
- understand why the job wasn't adopted after 1 30s interval
- consider bumping the max latency in the test or changing how the latency verifier works.
aha! the default session liveness is 40 seconds, so after the node was sigkilled, it's session could not be destroyed until after the second adoption loop! https://github.com/msbutler/cockroach/blob/butler-remove-deprecated-restore-checkpointing/pkg/sql/sqlliveness/slbase/slbase.go#L20
roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ afdd6a54182ba317c88db2725777d30b61e370a1:
(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.3000523s vs 2m0s
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=8encrypted=falsefs=ext4localSSD=trueruntimeAssertionsBuild=falsessd=0
Same failure on other branches
- #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2]
- #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]
fixed https://github.com/cockroachdb/cockroach/pull/137292