cockroach
cockroach copied to clipboard
roachtest: disk-stalled/wal-failover/among-stores failed
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ dafb6dd507b38fb3d6eb8b7e2493c7b8abed34d2:
(disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.281735339s at 2024-08-30T10:25:00Z
(cluster.go:2436).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
This test on roachdash | Improve this report!
Jira issue: CRDB-41774
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 4142920c2d5c50c0520c124764aeeda94ba043ae:
(disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.326598366s at 2024-09-03T11:14:00Z
(cluster.go:2444).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ fa9c0528fc0d06be1b4cfc534ec0501448111fbe:
(disk_stall.go:159).runDiskStalledWALFailover: process exited unexectedly
(cluster.go:2451).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=trueROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
The second failure https://github.com/cockroachdb/cockroach/issues/129922#issuecomment-2335168898 is a test flake due to injecting too long a stall. The test attempts to inject a 30s stall, and a 60s stall would result in a fatal error in the node (COCKROACH_LOG_MAX_SYNC_DURATION is set to 60s). But we see the test injecting a longer stall from 11:17:45 to 11:19:02:
2024/09/07 11:16:50 disk_stall.go:126: test status: pausing 54.985888517s before next simulated disk stall on n1
2024/09/07 11:17:45 cluster.go:2471: running cmd sudo dmsetup suspend --nofl... on nodes [:1]
2024/09/07 11:17:45 cluster.go:2473: details in run_111745.066909574_n1_sudo-dmsetup-suspend.log
2024/09/07 11:19:02 cluster.go:2471: running cmd sudo dmsetup resume data1 on nodes [:1]
And n1 dies due to this stall: F240907 11:18:46.126128 989637 storage/pebble.go:1530 â‹® [n1,s1,pebble] 1727 disk stall detected: disk slowness detected: syncdata on file 008404.log has been ongoing for 60.2s
In the first failure n1 loses leases, has no disk reads, has slot exhaustion.
failure: 2024/08/30 11:09:40 test_impl.go:423: test failure #1: full stack retained in failure_1.log: (disk_stall.go:172).runDiskStalledWALFailover: unexpectedly high p99.99 latency 2.281735339s at 2024-08-30T10:25:00Z
corresponding stall:
2024/08/30 10:24:39 cluster.go:2456: running cmd sudo dmsetup suspend --nofl... on nodes [:1]
2024/08/30 10:24:39 cluster.go:2458: details in run_102439.469185248_n1_sudo-dmsetup-suspend.log
2024/08/30 10:25:10 cluster.go:2456: running cmd sudo dmsetup resume data1 on nodes [:1]
2024/08/30 10:25:10 cluster.go:2458: details in run_102510.223616708_n1_sudo-dmsetup-resume-.log
This is similar to the failure in https://github.com/cockroachdb/cockroach/issues/124399#issuecomment-2123074288
One thing to note is that multiple stalls have a p100 of 10+s. The failure happens due to a stall where lower percentiles are also slow. That suggests that our disk read bytes (which are always 0) are not telling the whole story of what gets stuck, since if there was nothing getting stuck, even the p100 would consistently stay low.
p99.99 for Raft logcommit is also 10+s during the stall that caused the failure. But the write_and_sync latency for the WAL writer has a p100 ~150ms (due to WAL failover). Which suggests some code above the WAL writer (in Pebble or CockroachDB) is observing the stall, and it isn't necessarily reads (since Raft logcommit does not do reads).
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 833dadd212fa4b12b1442ae8e00e85ee80a8cdce:
(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=trueROACHTEST_ssd=2
Same failure on other branches
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
Same failure on other branches
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
Same failure on other branches
- #132988 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3 release-blocker]
- #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
Same failure on other branches
- #132988 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3 release-blocker]
- #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 1e5b3c212b45419c960038718c48a5dd75a111a0:
(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
Same failure on other branches
- #132988 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3]
- #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 787f2e3fe5f73b33fcd65485908cbb71e0991222:
(cluster.go:2336).Start: COMMAND_PROBLEM: exit status 1
(cluster.go:2449).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=16ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=trueROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=2
Same failure on other branches
- #132983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2.4-rc release-blocker]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 5a7850a72f941992b1bb4b23a73b5fa5e9f15a68:
(disk_stall.go:145).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2456).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=defaultruntimeAssertionsBuild=falsessd=2
Same failure on other branches
- #136428 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3.0-rc release-blocker]
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2 release-blocker]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 9354770c7c6eb5a89437068d8c6a4accf8031b67:
(disk_stall.go:145).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=epochruntimeAssertionsBuild=falsessd=2
Same failure on other branches
- #136428 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.3.0-rc]
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ efacd11db5f357a69f8b8fd0b10148028d87ed36:
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.023664807s at 2025-01-12T10:59:00Z
(cluster.go:2499).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=defaultruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 31e84cb3a57c52a779ff0982c95fb26646b54926:
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.147251281s at 2025-01-13T11:42:00Z
(cluster.go:2499).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=defaultruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 0b4d620740733ec61cf50ca26d19814299d91f8e:
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.086970059s at 2025-01-15T11:59:00Z
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.076571732s at 2025-01-15T12:00:00Z
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.109413442s at 2025-01-15T12:03:00Z
(cluster.go:2478).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=defaultruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 87f4821ccbbd683c4de29dfc06c43de806459ca4:
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.364494916s at 2025-01-18T10:33:00Z
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemetamorphicLeases=defaultruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #139321 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.1 release-blocker]
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 93fb203a469911c4a3ca7fb79f9a94adcb38689d:
(disk_stall.go:158).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.140714595s at 2025-01-22T11:13:00Z
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemetamorphicLeases=expirationruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #139321 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.1 release-blocker]
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #135983 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.3]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
The recent failures in builds with runtime assertions may be related to the expensive assertions within Pebble addressed in cockroachdb/pebble#4279 and cockroachdb/pebble#4278.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ a616c80e5c69c33ab1df58473eb0c3d0c522df36:
(disk_stall.go:150).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=defaultruntimeAssertionsBuild=falsessd=2
Same failure on other branches
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Another instance of the test runner leaving the disk stalled for way too long.
2025/01/24 10:42:59 disk_stall.go:116: test status: Stalling disk on n1
2025/01/24 10:42:59 cluster.go:2501: running cmd `sudo dmsetup suspend --nofl...` on nodes [:1]
2025/01/24 10:42:59 cluster.go:2503: details in run_104259.630053138_n1_sudo-dmsetup-suspend.log
2025/01/24 10:43:00 disk_stall.go:119: test status: Stalled disk on n1
2025/01/24 10:43:00 disk_stall.go:132: test status: waiting for 30s to elapse before unstalling
2025/01/24 10:44:09 disk_stall.go:127: test status: Unstalling disk on n1
2025/01/24 10:44:09 cluster.go:2501: running cmd `sudo dmsetup resume data1` on nodes [:1]
2025/01/24 10:44:09 cluster.go:2503: details in run_104409.015407178_n1_sudo-dmsetup-resume-.log
2025/01/24 10:44:09 disk_stall.go:129: test status: Unstalled disk on n1
What's going on between 43:00 and 44:09? Is the ticker not ticking? Is there a minute-long Go runtime stop the world event?
The too-long-of-a-stall seems like a pervasive issue, described in #138904.
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 3acb4cd1a4369c93718975681e228ffd74007832:
(disk_stall.go:150).runDiskStalledWALFailover: process exited unexpectedly
(cluster.go:2481).Run: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemetamorphicLeases=expirationruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #136355 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-24.2]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.1]
- #131553 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage branch-release-24.2.3-rc]
Confirming that the previous two failures overlapped with cdc/tpcc-1000/sink=kafka,
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 8403878059ac5f01003cf86c90fd53c54b9b8d58:
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.700866365s at 2025-07-23T08:40:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemetamorphicBufferedSender=truemetamorphicLeases=leaderruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
- #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
- #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 4a0f95aa0f11360e85a3221a8563a521d3f5499b:
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 3.223302145s at 2025-07-24T10:39:00Z
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 3.124708555s at 2025-07-24T10:40:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicLeases=defaultmetamorphicWriteBuffering=trueruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
- #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
- #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ b8c2405930718b735bf009c802798b4918b66631:
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.109629526s at 2025-07-25T08:54:00Z
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.774687992s at 2025-07-25T08:55:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicBufferedSender=truemetamorphicLeases=epochruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
- #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
- #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ afef2d239966541de7611883fbb803f08c0fed92:
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.193526716s at 2025-07-26T09:21:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=truefs=ext4localSSD=truemetamorphicBufferedSender=truemetamorphicLeases=expirationmetamorphicWriteBuffering=trueruntimeAssertionsBuild=falsessd=2
Same failure on other branches
- #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
- #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
- #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]
Note: This build has runtime assertions enabled. If the same failure was hit in a run without assertions enabled, there should be a similar failure without this message. If there isn't one, then this failure is likely due to an assertion violation or (assertion) timeout.
roachtest.disk-stalled/wal-failover/among-stores failed with artifacts on master @ 8765705442d59920da3424d770a76ee50f1eee06:
(disk_stall.go:169).runDiskStalledWALFailover: unexpectedly high p99.99 latency 1.310008015s at 2025-07-31T10:54:00Z
(disk_stall.go:183).Cleanup: failed to cleanup disk stall: context canceled
test artifacts and logs in: /artifacts/disk-stalled/wal-failover/among-stores/run_1
Parameters:
arch=amd64cloud=gcecoverageBuild=falsecpu=16encrypted=falsefs=ext4localSSD=truemetamorphicBufferedSender=truemetamorphicLeases=epochmetamorphicWriteBuffering=trueruntimeAssertionsBuild=truessd=2
Same failure on other branches
- #150616 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage B-runtime-assertions-enabled C-test-failure O-roachtest O-robot T-storage branch-release-25.3.0-rc release-blocker]
- #150487 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-25.3 release-blocker]
- #150099 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot T-storage branch-release-24.1.21-rc]
- #133804 roachtest: disk-stalled/wal-failover/among-stores failed [A-storage C-test-failure O-roachtest O-robot P-3 T-storage X-unactionable branch-release-24.1]
Some of the recent failures were caused by https://github.com/cockroachdb/cockroach/issues/151051. Leaving this issue open since it looked like there was a separate pre-existing issue.