roachtest: failover/non-system/disk-stall failed
roachtest.failover/non-system/disk-stall failed with artifacts on release-23.1 @ 53c2718ea13bfe632da68d25c64182fdf9648d80:
(disk_stall.go:303).Setup: full command output in run_131337.075895949_n1-7_echo-0-sudo-blockdev.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/failover/non-system/disk-stall/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_cpu=2ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_metamorphicBuild=falseROACHTEST_ssd=0
This test on roachdash | Improve this report!
Jira issue: CRDB-41620
run_131337.075895949_n1-7_echo-0-sudo-blockdev: 13:13:37 cluster.go:2164: > echo "0 $(sudo blockdev --getsz /dev/sdb) linear /dev/sdb 0" | sudo dmsetup create data1
teamcity-16608098-1724564397-78-n7cpu2:[1 2 3 4 5 6 7]: echo "0 $(sudo blockdev --g...
3: <err> COMMAND_PROBLEM: exit status 1
device-mapper: reload ioctl on data1 (253:0) failed: Device or resource busy
Command failed.
run_131337.075895949_n1-7_echo-0-sudo-blockdev: 13:13:37 cluster.go:2171: > result: COMMAND_PROBLEM: exit status 1
This failed here:
https://github.dev/cockroachdb/cockroach/blob/53c2718ea13bfe632da68d25c64182fdf9648d80/pkg/cmd/roachtest/tests/failover.go#L773-L776
@andrewbaptist I saw you took https://github.com/cockroachdb/cockroach/issues/129306 out of the queue last week. I dug further and saw Austen mention it here. This links to a comment of yours here claiming that this was fixed on 24.2+. After some poking around the log I thought maybe you were referring to https://github.com/cockroachdb/cockroach/pull/125257, but that's not it - we're not using the cgroup-based disk staller here. Also, the issue does still occur on 24.2 as of two weeks ago: https://github.com/cockroachdb/cockroach/issues/129047. Ironically, it was closed by referring (indirectly) to the aforementioned comment that the issue had been fixed on 24.2. Anyway, I am going to operate under the assumption that the issue is not fixed.
I then looked through all issues containing the word "blockdev" and most of them seem related. In particular, there's https://github.com/cockroachdb/cockroach/issues/126452 where @jbowens looks through the logs. That issue, too, is eventually closed (without resolution).
As a band-aid, I'll try adding a retry loop around this statement and link to this comment.
We have addressed this in newer versions with #123506. That was backported to 23.1 in #126842. @itsbilal do you want to take a look to determine if this is a new failure mode?
@itsbilal here is the file created by
/pkg/cmd/roachtest/roachtestutil/disk_stall.go#L200
s.c.Run(ctx, option.WithNodes(s.c.All()), "sudo bash -c 'ps aux; dmsetup status; mount; lsof'")
s.f.Fatal(err)
And the device is /dev/sdb 253:0
run_131337.075895949_n1-7_echo-0-sudo-blockdev: 13:13:37 cluster.go:2164: > echo "0 $(sudo blockdev --getsz /dev/sdb) linear /dev/sdb 0" | sudo dmsetup create data1
teamcity-16608098-1724564397-78-n7cpu2:[1 2 3 4 5 6 7]: echo "0 $(sudo blockdev --g...
3: <err> COMMAND_PROBLEM: exit status 1
device-mapper: reload ioctl on data1 (253:0) failed: Device or resource busy
The only mention of 253:0 I can see is
root 12492 0.0 0.0 0 0 ? I< 13:13 0:00 [kdmflush/253:0]
which I see six times, so it's on all nodes and probably not what failed the operation on n3.
sdb1 is more promising, there's this on n3:
root 130 0.3 0.0 0 0 ? S 13:11 0:00 [jbd2/sdb1-8]
jbd2/sdb1 130 root cwd DIR 8,17 4096 2 /
jbd2/sdb1 130 root rtd DIR 8,17 4096 2 /
jbd2/sdb1 130 root txt unknown /proc/130/exe
which is not the case on the other nodes. Can you make something of that?
@tbg jbd2 sounds like the ext4 journaller, right? That could explain a lot of the uncertainty around when this roachtest fails. It used to fail a lot more before https://github.com/cockroachdb/cockroach/pull/123506 , so that's the fix the others are referring to.
Maybe we should update the roachtests that rely on disk-stalling behaviour to disable journalling (maybe with tune2fs -O ^has_journal /dev/sda) and see if it helps reduce these flakes. Writes will be slower but this roachtest cares less about that.
Based on the specified backports for linked PR #129864, I applied the following new label(s) to this issue: branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.
:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.