cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: c2c/shutdown/dest/coordinator failed

Open cockroach-teamcity opened this issue 1 year ago • 1 comments

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ e5b1d125bf8cde9b4d47f4303b8a76ec735ca082:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.290093216s vs 2m0s
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2]
  • #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]
/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43245

cockroach-teamcity avatar Oct 16 '24 06:10 cockroach-teamcity

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ 39bae3f4961c14c890d6140e9268d2fbf0ca324a:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.852996135s vs 2m0s
(monitor.go:149).Wait: monitor failure: monitor user task failed: t.Fatal() was called
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2]
  • #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 20 '24 08:10 cockroach-teamcity

There may actually be smoke here. the max latency from this shutdown isn't actually a hair over 2 minutes-- rather, the latency verifier exits as soon as it sees latency above 2 minutes....

To construct a timeline of this test:

  • 6:47:47 node shutdown occurs on node 7
06:47:48 monitor.go:203: Monitor event: n7: error: no cockroach processes running
  • 6:48:46 node 5 picks up the job, i.e. a minute after node 7 shut the job down. the base adoption interval is 30 seconds, so i'm not sure why a node didn't pick up the job after the first 30 seconds had passed.
5.unredacted/cockroach.log:I241016 06:48:46.980619 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:408 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 520  empty cutover time, no revert required
5.unredacted/cockroach.log:I241016 06:48:46.986365 25353 ccl/streamingccl/streamingest/stream_ingestion_job.go:93 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=1012501106125406211›] 521  resuming stream (producer job 1012501106433556484) from 1729061216.168387056,0
  • Furthermore, by the time node 5 picks up the job, the replication lag is already at 1m:47s, so the stream only has 13 seconds to complete it's catchups scans, before the latency verifier fails
06:48:43 latency_verifier.go:140: test status: watching stream-ingestion: end-to-end steady latency 1m47.788s; max steady latency so far 1m47.788s; highwater 2024-10-16 06:46:56.168387056 +0000 UTC

Given that cutover succeeded, the stream was able to catch up after the node shutdown.

Next steps i think are:

  • understand why the job wasn't adopted after 1 30s interval
  • consider bumping the max latency in the test or changing how the latency verifier works.

msbutler avatar Oct 21 '24 18:10 msbutler

aha! the default session liveness is 40 seconds, so after the node was sigkilled, it's session could not be destroyed until after the second adoption loop! https://github.com/msbutler/cockroach/blob/butler-remove-deprecated-restore-checkpointing/pkg/sql/sqlliveness/slbase/slbase.go#L20

msbutler avatar Oct 21 '24 18:10 msbutler

roachtest.c2c/shutdown/dest/coordinator failed with artifacts on release-24.1 @ afdd6a54182ba317c88db2725777d30b61e370a1:

(latency_verifier.go:198).assertValid: max latency was more than allowed: 2m0.3000523s vs 2m0s
test artifacts and logs in: /artifacts/c2c/shutdown/dest/coordinator/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=8
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #131165 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.2]
  • #128742 roachtest: c2c/shutdown/dest/coordinator failed [C-test-failure O-roachtest O-robot P-3 T-disaster-recovery branch-release-24.1.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Dec 10 '24 07:12 cockroach-teamcity

fixed https://github.com/cockroachdb/cockroach/pull/137292

msbutler avatar Jan 15 '25 16:01 msbutler