cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: replicate/wide failed

Open cockroach-teamcity opened this issue 9 months ago • 3 comments

roachtest.replicate/wide failed with artifacts on master @ 6300c3c3367ad46ac48bf24915cf0d73cae446a0:

(allocator.go:359).func1: dial tcp 20.102.91.133:26257: connect: connection refused
test artifacts and logs in: /artifacts/replicate/wide/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=azure
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=1
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for azure clusters

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-38768

cockroach-teamcity avatar May 15 '24 13:05 cockroach-teamcity

Node 1 reports slow heartbeats, and terminates after detecting disk stall:

F240515 12:54:52.692794 150690 1@util/log/file.go:269 â‹® [-] 265  disk stall detected: unable to sync log files within 20s

10s later, the roachtest could not connect to the node:

12:55:01 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (allocator.go:359).func1: dial tcp 20.102.91.133:26257: connect: connection refused

pav-kv avatar May 15 '24 13:05 pav-kv

OTOH, the node 1 is asked to be decommissioned in this test, prior to the disk stall. I wonder if these two events correlate.

12:53:52 cluster.go:2371: running cmd `./cockroach node decommissi...` on nodes [:1]; details in run_125352.624956031_n1_cockroach-node-decom.log
12:53:53 allocator.go:403: 68 mis-replicated ranges
12:53:54 allocator.go:403: 47 mis-replicated ranges
12:53:55 allocator.go:403: 10 mis-replicated ranges
12:53:56 allocator.go:403: 0 mis-replicated ranges
12:53:57 allocator.go:374: SET CLUSTER SETTING server.time_until_store_dead = '90s'

F240515 12:54:52.692794 150690 1@util/log/file.go:269 â‹® [-] 265  disk stall detected: unable to sync log files within 20s

12:55:01 test_impl.go:414: test failure #1: full stack retained in failure_1.log: (allocator.go:359).func1: dial tcp 20.102.91.133:26257: connect: connection refused

pav-kv avatar May 15 '24 13:05 pav-kv

Noting that this is on Azure:

ROACHTEST_cloud=azure

Which has had a greater degree of instability in roachtests recently. I'd be pro chalking up to an infra flake.

kvoli avatar May 15 '24 17:05 kvoli