cockroach roachtest: failover/non-system/disk-stall failed

roachtest.failover/non-system/disk-stall failed with artifacts on release-23.1 @ 53c2718ea13bfe632da68d25c64182fdf9648d80:

(disk_stall.go:303).Setup: full command output in run_131337.075895949_n1-7_echo-0-sudo-blockdev.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/failover/non-system/disk-stall/run_1

Parameters:

ROACHTEST_arch=amd64
ROACHTEST_cloud=gce
ROACHTEST_cpu=2
ROACHTEST_encrypted=false
ROACHTEST_fs=ext4
ROACHTEST_localSSD=false
ROACHTEST_metamorphicBuild=false
ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage _{This test on roachdash | Improve this report!}

Jira issue: CRDB-41620

Aug 25 '24 13:08 cockroach-teamcity

run_131337.075895949_n1-7_echo-0-sudo-blockdev: 13:13:37 cluster.go:2164: > echo "0 $(sudo blockdev --getsz /dev/sdb) linear /dev/sdb 0" | sudo dmsetup create data1
teamcity-16608098-1724564397-78-n7cpu2:[1 2 3 4 5 6 7]: echo "0 $(sudo blockdev --g...
   3: 	<err> COMMAND_PROBLEM: exit status 1
	device-mapper: reload ioctl on data1 (253:0) failed: Device or resource busy
	Command failed.
	
run_131337.075895949_n1-7_echo-0-sudo-blockdev: 13:13:37 cluster.go:2171: > result: COMMAND_PROBLEM: exit status 1

This failed here:

https://github.dev/cockroachdb/cockroach/blob/53c2718ea13bfe632da68d25c64182fdf9648d80/pkg/cmd/roachtest/tests/failover.go#L773-L776

@andrewbaptist I saw you took https://github.com/cockroachdb/cockroach/issues/129306 out of the queue last week. I dug further and saw Austen mention it here. This links to a comment of yours here claiming that this was fixed on 24.2+. After some poking around the log I thought maybe you were referring to https://github.com/cockroachdb/cockroach/pull/125257, but that's not it - we're not using the cgroup-based disk staller here. Also, the issue does still occur on 24.2 as of two weeks ago: https://github.com/cockroachdb/cockroach/issues/129047. Ironically, it was closed by referring (indirectly) to the aforementioned comment that the issue had been fixed on 24.2. Anyway, I am going to operate under the assumption that the issue is not fixed.

I then looked through all issues containing the word "blockdev" and most of them seem related. In particular, there's https://github.com/cockroachdb/cockroach/issues/126452 where @jbowens looks through the logs. That issue, too, is eventually closed (without resolution).

As a band-aid, I'll try adding a retry loop around this statement and link to this comment.

Aug 26 '24 08:08 tbg

We have addressed this in newer versions with #123506. That was backported to 23.1 in #126842. @itsbilal do you want to take a look to determine if this is a new failure mode?

Aug 26 '24 14:08 andrewbaptist

@itsbilal here is the file created by

/pkg/cmd/roachtest/roachtestutil/disk_stall.go#L200

		s.c.Run(ctx, option.WithNodes(s.c.All()), "sudo bash -c 'ps aux; dmsetup status; mount; lsof'")
		s.f.Fatal(err)

And the device is /dev/sdb 253:0

run_131337.075895949_n1-7_echo-0-sudo-blockdev: 13:13:37 cluster.go:2164: > echo "0 $(sudo blockdev --getsz /dev/sdb) linear /dev/sdb 0" | sudo dmsetup create data1
teamcity-16608098-1724564397-78-n7cpu2:[1 2 3 4 5 6 7]: echo "0 $(sudo blockdev --g...
   3: 	<err> COMMAND_PROBLEM: exit status 1
	device-mapper: reload ioctl on data1 (253:0) failed: Device or resource busy

The only mention of 253:0 I can see is

root 12492 0.0 0.0 0 0 ? I< 13:13 0:00 [kdmflush/253:0]

which I see six times, so it's on all nodes and probably not what failed the operation on n3.

sdb1 is more promising, there's this on n3:

	root         130  0.3  0.0      0     0 ?        S    13:11   0:00 [jbd2/sdb1-8]

jbd2/sdb1   130                           root  cwd       DIR               8,17     4096          2 /
	jbd2/sdb1   130                           root  rtd       DIR               8,17     4096          2 /
	jbd2/sdb1   130                           root  txt   unknown                                        /proc/130/exe

which is not the case on the other nodes. Can you make something of that?

Aug 28 '24 14:08 tbg

@tbg jbd2 sounds like the ext4 journaller, right? That could explain a lot of the uncertainty around when this roachtest fails. It used to fail a lot more before https://github.com/cockroachdb/cockroach/pull/123506 , so that's the fix the others are referring to.

Maybe we should update the roachtests that rely on disk-stalling behaviour to disable journalling (maybe with tune2fs -O ^has_journal /dev/sda) and see if it helps reduce these flakes. Writes will be slower but this roachtest cares less about that.

Aug 28 '24 19:08 itsbilal

Based on the specified backports for linked PR #129864, I applied the following new label(s) to this issue: branch-release-23.2, branch-release-24.1, branch-release-24.2. Please adjust the labels as needed to match the branches actually affected by this issue, including adding any known older branches.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Sep 03 '24 15:09 blathers-crl[bot]