cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: disk-stalled/wal-failover/among-stores failed

Open cockroach-teamcity opened this issue 1 year ago • 8 comments

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ dafb6dd507b38fb3d6eb8b7e2493c7b8abed34d2:

(disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.281735339s at 2024-08-30T10:25:00Z
(cluster.go:2436).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-41774

cockroach-teamcity avatar Aug 30 '24 11:08 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 4142920c2d5c50c0520c124764aeeda94ba043ae:

(disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.326598366s at 2024-09-03T11:14:00Z
(cluster.go:2444).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Sep 03 '24 12:09 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ fa9c0528fc0d06be1b4cfc534ec0501448111fbe:

(disk_stall.go:159).runDiskStalledWALFailover: process exited unexectedly
(cluster.go:2451).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=true
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Sep 07 '24 12:09 cockroach-teamcity

The second failure https://github.com/cockroachdb/cockroach/issues/129922#issuecomment-2335168898 is a test flake due to injecting too long a stall. The test attempts to inject a 30s stall, and a 60s stall would result in a fatal error in the node (COCKROACH_LOG_MAX_SYNC_DURATION is set to 60s). But we see the test injecting a longer stall from 11:17:45 to 11:19:02: 2024/09/07 11:16:50 disk_stall.go:126: test status: pausing 54.985888517s before next simulated disk stall on n1 2024/09/07 11:17:45 cluster.go:2471: running cmd sudo dmsetup suspend --nofl... on nodes [:1] 2024/09/07 11:17:45 cluster.go:2473: details in run_111745.066909574_n1_sudo-dmsetup-suspend.log 2024/09/07 11:19:02 cluster.go:2471: running cmd sudo dmsetup resume data1 on nodes [:1]

And n1 dies due to this stall: F240907 11:18:46.126128 989637 storage/pebble.go:1530 â‹® [n1,s1,pebble] 1727 disk stall detected: disk slowness detected: syncdata on file 008404.log has been ongoing for 60.2s

sumeerbhola avatar Sep 11 '24 18:09 sumeerbhola

In the first failure n1 loses leases, has no disk reads, has slot exhaustion.

failure: 2024/08/30 11:09:40 test_impl.go:423: test failure #1: full stack retained in failure_1.log: (disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.281735339s at 2024-08-30T10:25:00Z

corresponding stall: 2024/08/30 10:24:39 cluster.go:2456: running cmd sudo dmsetup suspend --nofl... on nodes [:1] 2024/08/30 10:24:39 cluster.go:2458: details in run_102439.469185248_n1_sudo-dmsetup-suspend.log 2024/08/30 10:25:10 cluster.go:2456: running cmd sudo dmsetup resume data1 on nodes [:1] 2024/08/30 10:25:10 cluster.go:2458: details in run_102510.223616708_n1_sudo-dmsetup-resume-.log

This is similar to the failure in https://github.com/cockroachdb/cockroach/issues/124399#issuecomment-2123074288

One thing to note is that multiple stalls have a p100 of 10+s. The failure happens due to a stall where lower percentiles are also slow. That suggests that our disk read bytes (which are always 0) are not telling the whole story of what gets stuck, since if there was nothing getting stuck, even the p100 would consistently stay low. Screenshot 2024-09-11 at 3 24 05 PM

sumeerbhola avatar Sep 11 '24 19:09 sumeerbhola

p99.99 for Raft logcommit is also 10+s during the stall that caused the failure. But the write_and_sync latency for the WAL writer has a p100 ~150ms (due to WAL failover). Which suggests some code above the WAL writer (in Pebble or CockroachDB) is observing the stall, and it isn't necessarily reads (since Raft logcommit does not do reads). Screenshot 2024-09-13 at 8 14 43 AM

Screenshot 2024-09-13 at 8 29 48 AM

sumeerbhola avatar Sep 13 '24 12:09 sumeerbhola

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 833dadd212fa4b12b1442ae8e00e85ee80a8cdce:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=true
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 18 '24 08:10 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 19 '24 09:10 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #132988 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3 release-blocker]
  • #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 20 '24 09:10 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #132988 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3 release-blocker]
  • #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 21 '24 09:10 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 1e5b3c212b45419c960038718c48a5dd75a111a0:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #132988 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3]
  • #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 22 '24 09:10 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 787f2e3fe5f73b33fcd65485908cbb71e0991222:

(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=16
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 23 '24 09:10 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 5a7850a72f941992b1bb4b23a73b5fa5e9f15a68:

(disk_stall.go:145).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2456).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136428 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3.0-rc release-blocker]
  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2 release-blocker]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Dec 05 '24 01:12 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 9354770c7c6eb5a89437068d8c6a4accf8031b67:

(disk_stall.go:145).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=false
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136428 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3.0-rc]
  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Dec 16 '24 11:12 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:

(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.023664807s at 2025-01-12T10:59:00Z
(cluster.go:2499).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 12 '25 11:01 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 31e84cb3a57c52a779ff0982c95fb26646b54926:

(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.147251281s at 2025-01-13T11:42:00Z
(cluster.go:2499).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 13 '25 11:01 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 0b4d620740733ec61cf50ca26d19814299d91f8e:

(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.086970059s at 2025-01-15T11:59:00Z
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.076571732s at 2025-01-15T12:00:00Z
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.109413442s at 2025-01-15T12:03:00Z
(cluster.go:2478).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 15 '25 12:01 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 87f4821ccbbd683c4de29dfc06c43de806459ca4:

(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.364494916s at 2025-01-18T10:33:00Z
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #139321 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.1 release-blocker]
  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 18 '25 10:01 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 93fb203a469911c4a3ca7fb79f9a94adcb38689d:

(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.140714595s at 2025-01-22T11:13:00Z
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #139321 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.1 release-blocker]
  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 22 '25 11:01 cockroach-teamcity

The recent failures in builds with runtime assertions may be related to the expensive assertions within Pebble addressed in cockroachdb/pebble#4279 and cockroachdb/pebble#4278.

jbowens avatar Jan 23 '25 16:01 jbowens

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ a616c80e5c69c33ab1df58473eb0c3d0c522df36:

(disk_stall.go:150).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • runtimeAssertionsBuild=false
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 24 '25 10:01 cockroach-teamcity

Another instance of the test runner leaving the disk stalled for way too long.

2025/01/24 10:42:59 disk_stall.go:116: test status: Stalling disk on n1
2025/01/24 10:42:59 cluster.go:2501: running cmd `sudo dmsetup suspend --nofl...` on nodes [:1]
2025/01/24 10:42:59 cluster.go:2503: details in run_104259.630053138_n1_sudo-dmsetup-suspend.log
2025/01/24 10:43:00 disk_stall.go:119: test status: Stalled disk on n1
2025/01/24 10:43:00 disk_stall.go:132: test status: waiting for 30s to elapse before unstalling
2025/01/24 10:44:09 disk_stall.go:127: test status: Unstalling disk on n1
2025/01/24 10:44:09 cluster.go:2501: running cmd `sudo dmsetup resume data1` on nodes [:1]
2025/01/24 10:44:09 cluster.go:2503: details in run_104409.015407178_n1_sudo-dmsetup-resume-.log
2025/01/24 10:44:09 disk_stall.go:129: test status: Unstalled disk on n1

What's going on between 43:00 and 44:09? Is the ticker not ticking? Is there a minute-long Go runtime stop the world event?

jbowens avatar Jan 24 '25 15:01 jbowens

The too-long-of-a-stall seems like a pervasive issue, described in #138904.

jbowens avatar Jan 24 '25 15:01 jbowens

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 3acb4cd1a4369c93718975681e228ffd74007832:

(disk_stall.go:150).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=expiration
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
  • #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jan 27 '25 10:01 cockroach-teamcity

Confirming that the previous two failures overlapped with cdc/tpcc-1000/sink=kafka,

Image Image

srosenberg avatar Jan 30 '25 05:01 srosenberg

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 8403878059ac5f01003cf86c90fd53c54b9b8d58:

(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.700866365s at 2025-07-23T08:40:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • metamorphicLeases=leader
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
  • #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
  • #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jul 23 '25 09:07 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 4a0f95aa0f11360e85a3221a8563a521d3f5499b:

(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 3.223302145s at 2025-07-24T10:39:00Z
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 3.124708555s at 2025-07-24T10:40:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicLeases=default
  • metamorphicWriteBuffering=true
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
  • #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
  • #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jul 24 '25 10:07 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ b8c2405930718b735bf009c802798b4918b66631:

(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.109629526s at 2025-07-25T08:54:00Z
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.774687992s at 2025-07-25T08:55:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • metamorphicLeases=epoch
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
  • #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
  • #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jul 25 '25 09:07 cockroach-teamcity

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ afef2d239966541de7611883fbb803f08c0fed92:

(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.193526716s at 2025-07-26T09:21:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=true
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • metamorphicLeases=expiration
  • metamorphicWriteBuffering=true
  • runtimeAssertionsBuild=false
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
  • #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
  • #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jul 26 '25 10:07 cockroach-teamcity

Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.

roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 8765705442d59920da3424d770a76ee50f1eee06:

(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.310008015s at 2025-07-31T10:54:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1

Parameters:

  • arch=amd64
  • cloud=gce
  • coverageBuild=false
  • cpu=16
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • metamorphicBufferedSender=true
  • metamorphicLeases=epoch
  • metamorphicWriteBuffering=true
  • runtimeAssertionsBuild=true
  • ssd=2
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
  • #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
  • #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
  • #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Jul 31 '25 11:07 cockroach-teamcity

Some of the recent failures were caused by https://github.com/cockroachdb/cockroach/issues/151051. Leaving this issue open since it looked like there was a separate pre-existing issue.

rafiss avatar Jul 31 '25 15:07 rafiss