disk out of space errors causing lots of tests to fail
Upon updating my image to xfstests-amd64-202410151341, I am experiencing a huge uptick in failures due to disk out-of-space errors:
cat 525.out.bad
QA output created by 525
Silence is golden
/root/xfstests/tests/xfs/525: line 33: /xt-vdd/a: No space left on device
The other failing tests are also experiencing this issue.
Before I was on a custom build image, so I am not sure when this started occuring with regards to the varying official image releases.
xfs/v4: 3459 tests, 650 failures, 1502 skipped, 17352 seconds
Failures: generic/054 generic/055 generic/260 generic/363 generic/365
generic/743 generic/754 xfs/014 xfs/049 xfs/073 xfs/077 xfs/078
xfs/080 xfs/084 xfs/107 xfs/119 xfs/148 xfs/149 xfs/164 xfs/165
xfs/206 xfs/216 xfs/217 xfs/222 xfs/250 xfs/252 xfs/253 xfs/259
xfs/260 xfs/263 xfs/284 xfs/289 xfs/290 xfs/292 xfs/294 xfs/299
xfs/304 xfs/305 xfs/512 xfs/513 xfs/514 xfs/515 xfs/522 xfs/523
xfs/524 xfs/525 xfs/526 xfs/557 xfs/602 xfs/605 xfs/606 xfs/613
xfs/614 xfs/629 xfs/630 xfs/631 xfs/632 xfs/789 xfs/803 xfs/804
xfs/806
Flaky: generic/001: 75% (3/4) generic/002: 50% (2/4)
generic/004: 75% (3/4) generic/005: 75% (3/4)
generic/006: 75% (3/4) generic/007: 75% (3/4)
generic/008: 75% (3/4) generic/009: 75% (3/4)
generic/011: 75% (3/4) generic/012: 75% (3/4)
generic/013: 75% (3/4) generic/014: 75% (3/4)
generic/016: 75% (3/4) generic/020: 75% (3/4)
generic/021: 75% (3/4) generic/022: 75% (3/4)
generic/023: 75% (3/4) generic/024: 75% (3/4)
generic/025: 75% (3/4) generic/026: 75% (3/4)
generic/028: 75% (3/4) generic/035: 75% (3/4)
generic/036: 50% (2/4) generic/037: 25% (1/4)
generic/053: 25% (1/4) generic/058: 75% (3/4)
generic/060: 75% (3/4) generic/061: 75% (3/4)
generic/063: 75% (3/4) generic/066: 25% (1/4)
generic/067: 75% (3/4) generic/070: 75% (3/4)
generic/072: 75% (3/4) generic/074: 75% (3/4)
generic/075: 75% (3/4) generic/077: 25% (1/4)
generic/078: 75% (3/4) generic/080: 75% (3/4)
generic/082: 75% (3/4) generic/086: 75% (3/4)
generic/087: 50% (2/4) generic/088: 50% (2/4)
generic/089: 50% (2/4) generic/091: 50% (2/4)
generic/092: 75% (3/4) generic/093: 75% (3/4)
generic/097: 75% (3/4) generic/099: 75% (3/4)
generic/100: 67% (2/3) generic/112: 67% (2/3)
generic/113: 67% (2/3) generic/114: 67% (2/3)
generic/123: 67% (2/3) generic/124: 67% (2/3)
generic/125: 67% (2/3) generic/126: 67% (2/3)
generic/127: 67% (2/3) generic/131: 67% (2/3)
generic/133: 67% (2/3) generic/184: 67% (2/3)
generic/192: 67% (2/3) generic/193: 67% (2/3)
generic/198: 67% (2/3) generic/207: 67% (2/3)
generic/208: 67% (2/3) generic/209: 67% (2/3)
generic/210: 67% (2/3) generic/211: 67% (2/3)
generic/212: 67% (2/3) generic/214: 67% (2/3)
generic/215: 67% (2/3) generic/221: 67% (2/3)
The overlay failures look a bit different:
QA output created by 009
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
Silence is golden
QA output created by 010
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
mount: /xt-vdc/ovl-mnt: mount point does not exist.
dmesg(1) may have more information after failed mount system call.
Silence is golden
QA output created by 011
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
mount: /xt-vdc/ovl-mnt: mount point does not exist.
dmesg(1) may have more information after failed mount system call.
getfattr: /xt-vdc/ovl-mnt/testdir: No such file or directory
Silence is golden
QA output created by 012
rm: cannot remove '/xt-vdc/ovl-upper/symlink.288': Structure needs cleaning
rm: cannot remove '/xt-vdc/ovl-upper/symlink.320': Structure needs cleaning
I'm not seeing this failure, so I decided to take a closer look.
I just tested using a test appliance built using the tip of xfstests-bld and it's not failing for me. Looking at line 33 of xfstests/tests/525, what I see at line 33 is "rm -rf $def_cfgfile", but I'm guessing you have a slightly different version of xfstests, and what is at line 33 corresponds to "cat > $def_cfgfile << ENDL", where def_cfg is $TEST_DIR/a.
So what I am guessing is that some previous test is leaving enough gunk left over in the test file system. So what I would suggest adding is before line 33, something like this:
df >> $seqres.full ls -sFR $TEST_DIR >> $seqres.full
... and then see what shows up in /results/xfs/results-4k/xfs/525.full
This might give us a hint as to what test might be causing the problem?
Also, which kernel version were you testing, and what set of tests were you running? Does this show up easily with "gce-xfstests -c xfs/4k -g auto"?
looked into this a bit more, there are several symptoms going on but I think they might be related so I'll put it all here
it repros pretty reliably on xfstests-cloud's xfstests-amd64-202410151341 with kernel 5.15.171, full command line below.
issue 1) I have been seeing shards missing from the reports
ex)
TESTRUNID: ltm-20241212114452
KERNEL: kernel 5.15.171-xfstests #1 SMP Mon Nov 4 15:40:14 PST 2024 x86_64
CMDLINE: -c xfs/4k,xfs/1k,xfs/logdev,xfs/realtime,xfs/quota,xfs/v4,xfs/dax,xfs/adv,xfs/dirblock_8k,xfs:overlay/small -g auto -C 1 --image-project xfstests-cloud -I xfstests-amd64-202410151341 -X generic/754 --kernel gs://xfs-backports-bucket/5.15.171.deb
CPUS: 2
MEM: 7680
xfs/v4: 1153 tests, 71 failures, 489 skipped, 7523 seconds
Failures: generic/054 generic/055 generic/260 generic/363 generic/365
generic/645 generic/743 xfs/014 xfs/049 xfs/073 xfs/077 xfs/078
xfs/080 xfs/082 xfs/084 xfs/107 xfs/119 xfs/148 xfs/149 xfs/164
xfs/165 xfs/185 xfs/206 xfs/216 xfs/217 xfs/222 xfs/242 xfs/250
xfs/252 xfs/253 xfs/259 xfs/260 xfs/263 xfs/284 xfs/289 xfs/290
xfs/292 xfs/294 xfs/299 xfs/304 xfs/305 xfs/348 xfs/512 xfs/513
xfs/514 xfs/515 xfs/522 xfs/523 xfs/524 xfs/525 xfs/526 xfs/530
xfs/543 xfs/549 xfs/557 xfs/597 xfs/598 xfs/602 xfs/605 xfs/606
xfs/613 xfs/614 xfs/629 xfs/630 xfs/631 xfs/632 xfs/789 xfs/803
xfs/804 xfs/805 xfs/806
xfs:overlay/small: 828 tests, 21 failures, 1 errors, 352 skipped, 3824 seconds
Failures: generic/003 generic/012 generic/013 generic/016 generic/017
generic/021 generic/022 generic/029 generic/030 generic/031
generic/032 generic/033 generic/037 generic/112 generic/127
generic/263 generic/476 generic/616 generic/733 generic/750
overlay/019
Errors: generic/042
Totals: 1981 tests, 841 skipped, 92 failures, 1 errors, 10782s
FSTESTIMG: xfstests-cloud/xfstests-amd64-202410151341
FSTESTPRJ: xfs-backports
FSTESTVER: blktests f043065 (Thu, 20 Jun 2024 16:26:22 +0900)
FSTESTVER: fio fio-3.38 (Wed, 2 Oct 2024 12:53:13 -0600)
FSTESTVER: fsverity v1.6 (Wed, 20 Mar 2024 21:21:46 -0700)
FSTESTVER: ima-evm-utils v1.5 (Mon, 6 Mar 2023 07:40:07 -0500)
FSTESTVER: libaio libaio-0.3.108-82-gb8eadc9 (Thu, 2 Jun 2022 13:33:11 +0200)
FSTESTVER: ltp 20240930 (Mon, 30 Sep 2024 12:27:12 +0200)
FSTESTVER: quota v4.05-69-g68952f1 (Mon, 7 Oct 2024 15:45:56 -0400)
FSTESTVER: util-linux v2.40.2 (Thu, 4 Jul 2024 09:59:17 +0200)
FSTESTVER: xfsprogs v6.10.1-91-g42523142 (Tue, 8 Oct 2024 14:31:31 +0200)
FSTESTVER: xfstests-bld 71bcf39c (Tue, 15 Oct 2024 13:27:36 -0400)
FSTESTVER: xfstests v2024.10.14-11-g173cdbc07 (Tue, 15 Oct 2024 09:31:29 -0400)
FSTESTVER: zz_build-distro bookworm
FSTESTSET: -g auto
FSTESTEXC: generic/754
FSTESTOPT: count 1 fail_loop_count 0 aex
Looked into a missing shard and tests are failing to start at all, it seems there are issues with the device mapper?
aa/runtests.log:
Removed "/etc/systemd/system/timers.target.wants/gce-finalize.timer".
CMDLINE: "--instance-name xfstests-ltm-20241212114452-aa --gce-zone us-east4-a --gs-bucket xfs-backports-bucket --kernel gs://xfs-backports-bucket/5.15.171.deb --bucket-subdir results --no-email --no-vm-timeout -c xfs/4k --arch amd64 -g auto -C 1 --image-project xfstests-cloud -I xfstests-amd64-202410151341 -X generic/754"
FSTESTIMG: xfstests-cloud/xfstests-amd64-202410151341
FSTESTPRJ: xfs-backports
KERNEL: kernel 5.15.171-xfstests #1 SMP Mon Nov 4 15:40:14 PST 2024 x86_64
FSTESTVER: blktests f043065 (Thu, 20 Jun 2024 16:26:22 +0900)
FSTESTVER: fio fio-3.38 (Wed, 2 Oct 2024 12:53:13 -0600)
FSTESTVER: fsverity v1.6 (Wed, 20 Mar 2024 21:21:46 -0700)
FSTESTVER: ima-evm-utils v1.5 (Mon, 6 Mar 2023 07:40:07 -0500)
FSTESTVER: libaio libaio-0.3.108-82-gb8eadc9 (Thu, 2 Jun 2022 13:33:11 +0200)
FSTESTVER: ltp 20240930 (Mon, 30 Sep 2024 12:27:12 +0200)
FSTESTVER: quota v4.05-69-g68952f1 (Mon, 7 Oct 2024 15:45:56 -0400)
FSTESTVER: util-linux v2.40.2 (Thu, 4 Jul 2024 09:59:17 +0200)
FSTESTVER: xfsprogs v6.10.1-91-g42523142 (Tue, 8 Oct 2024 14:31:31 +0200)
FSTESTVER: xfstests-bld 71bcf39c (Tue, 15 Oct 2024 13:27:36 -0400)
FSTESTVER: xfstests v2024.10.14-11-g173cdbc07 (Tue, 15 Oct 2024 09:31:29 -0400)
FSTESTVER: zz_build-distro bookworm
FSTESTCFG: "xfs/4k"
FSTESTSET: "-g auto"
FSTESTEXC: "generic/754"
FSTESTOPT: "count 1 fail_loop_count 0 aex"
MNTOPTS: ""
CPUS: "2"
MEM: "7680"
DMI_MEM: 8 GB (Max capacity)
PARAM_MEM: 7680 (restricted by cmdline)
GCE ID: "5676890524304704375"
GCE ZONE: "us-east4-a"
MACHINE TYPE: "e2-standard-2"
TESTRUNID: ltm-20241212114452-aa
total used free shared buff/cache available
Mem: 6436 248 5998 0 274 6188
Swap: 0 0 0
meta-data=/dev/mapper/xt-vdd isize=512 agcount=4, agsize=327680 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0
data = bsize=4096 blocks=1310720, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Discarding blocks...Done.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
Failed to run ./check -n -g auto
common/rc: retrying test device mount with external set
common/rc: could not mount /dev/mapper/xt-vdd on /xt-vdd
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
Exit status 1
No results file found in /results
-------------------- Summary report
Truncating test artifacts in /results to 31k
issue 2) I have also been repeatedly seeing the pattern above where after a certain point in the test run, all subsequent tests seem to fail.
Issue 3) xfs:overlay/small included the serial output which contained a ton of logspam. The file was 42M, with 94% of lines being
/usr/local/lib/gce-finalize-wait: line 6: /usr/bin/sleep: Input/output error
I believe gce-finalize-wait is a service that watches for the VM to complete its testrun:
#!/bin/bash
touch /run/gce-finalize-wait
while test -f /run/gce-finalize-wait
do
sleep 1
done
exec /usr/local/lib/gce-finalize
Re suggestion to add df and ls commands, the xfs/525 out of space failure in particular doesn't reliably repro
On Thu, Dec 12, 2024 at 03:35:11PM -0800, Leah Rumancik wrote:
Looked into a missing shard and tests are failing to start at all, it seems there are issues with the device mapper?
I looked at some of these failures on my end, and it looks like it's caused by 5.15 missing certain XFS features?
From the top-level runtests.log:
============SHARD ap============ ============CONFIG: xfs/adv
...
meta-data=/dev/mapper/xt-vdd isize=512 agcount=4, agsize=327680 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=0
data = bsize=4096 blocks=1310720, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=16384, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
Discarding blocks...Done.
mount: /xt-vdd: wrong fs type, bad option, bad superblock on /dev/mapper/xt-vdd, missing codepage or helper program, or other error.
dmesg(1) may have more information after failed mount system call.
But if we then look at ap/syslog, we see:
[ 40.541369] XFS (dm-2): Superblock has unknown incompatible features (0x20) enabled. [ 40.549319] XFS (dm-2): Filesystem cannot be safely mounted by this kernel.
So we may need to exclude certain tests from running, or modify the mkfs.xfs argumentsd for certain configs, if the kernel version is suficiently old...
issue 2) I have also been repeatedly seeing the pattern above where after a certain point in the test run, all subsequent tests seem to fail.
Looking at one such example, I'm seeing the following:
--- tests/xfs/546.out 2024-11-25 21:12:04.000000000 -0500
+++ /results/xfs/results-v4/xfs/546.out.bad 2024-12-13 12:56:16.486394498 -0500
@@ -1,2 +1,3 @@
QA output created by 546
+touch: cannot touch '/xt-vdd/2854732.xfs_io': No space left on device
syncfs: Input/output error
I'm also seeing a lot of this:
xfs/082 [12:16:31]umount: /xt-vdd: target is busy. _check_xfs_filesystem: filesystem on /dev/mapper/xt-vdd has dirty log (see /results/xfs/results-v4/xfs/082.full for details) _check_xfs_filesystem: filesystem on /dev/mapper/xt-vdd is inconsistent (r) (see /results/xfs/results-v4/xfs/082.full for details) Trying to repair broken TEST_DEV file system mount: /xt-vdd: /dev/mapper/xt-vdd already mounted on /xt-vdd. dmesg(1) may have more information after failed mount system call. _repair_test_fs: failed, err=1 (see /results/xfs/results-v4/xfs/082.full for details) mount: /xt-vdd: /dev/mapper/xt-vdd already mounted on /xt-vdd. dmesg(1) may have more information after failed mount system call. [12:16:34]- output mismatch (see /results/xfs/results-v4/xfs/082.out.bad) --- tests/xfs/082.out 2024-11-25 21:12:04.000000000 -0500 +++ /results/xfs/results-v4/xfs/082.out.bad 2024-12-13 12:16:34.069268192 -0500 @@ -1,3 +1,6 @@ QA output created by 082 -bad magic number -xfs_copy: couldn't read superblock, error=22 +/xt-vdd/082.a: No space left on device +/xt-vdd/082.b: No space left on device +xfs_copy: couldn't open source "/xt-vdd/082.a" +Aborting XFS copy - reason: No such file or directory ... (Run 'diff -u /root/xfstests/tests/xfs/082.out /results/xfs/results-v4/xfs/082.out.bad' to see the entire diff)
So we seem to have a two things going on (a) the XFS file system on TEST_DEV is getting corrupted, and (b) we're getting ENOSPC errors. Worse, (c) xfs_repair is apparently unable to fix the file system, and we're not aborting the test run when that happens?
I'm not sure whether (a) or (b) is the chicken or the egg; which one is the cause, and which one is the symptom? Is there an XFS bug in 5.15 which causes the file system to get corrupted on an ENOSPC error? Or is the unrepaired filesytem inconsistency causing the ENOSPC errors?
Issue 3) xfs:overlay/small included the serial output which contained a ton of logspam. The file was 42M, with 94% of lines being
/usr/local/lib/gce-finalize-wait: line 6: /usr/bin/sleep: Input/output error
So this is weird. /usr/bin/sleep shouldn't be accessing TEST_DEV or SCRATCH_DEV. Neither should /usr/local/lib/gce-finalize-wait for that matter. Are there perchance any errors in the syslog or dmesg file referencing the root file system and ext4? And if so, I wonder if some kernel bug triggered by a test is actually causing a wild pointer dereference which is corrupting something associated with /dev/sda or the root file system?!?
Final question --- how much do we care about 5.15? 6.12 will likely soon be the new LTS kernel, and I've been planning on only worrying about 6.1, 6.6, and 6.12 for ext4 LTS testing come the new year.
I've take a look at test runs for 6.1, and while it does have a bunch of problems for XFS, we're at least not seeing TEST_DEV corruptions on /dev/xt/vdd. So maybe let's focus on stablizing 6.1 and 6.6, and give up on 5.15 at least for now?
KERNEL: kernel 6.1.120-rc1-xfstests-00773-g9f320894b9c2 #24 SMP PREEMPT_DYNAMIC Thu Dec 12 10:03:19 EST 2024 x86_64
xfs/4k: 1160 tests, 36 failures, 184 skipped, 20066 seconds
Failures: generic/083 generic/091 generic/127 generic/263 generic/269
generic/270 generic/363 generic/365 generic/447 generic/476
generic/616 generic/648 generic/743 generic/750 generic/754
generic/757 xfs/104 xfs/273 xfs/305 xfs/348 xfs/442 xfs/538 xfs/597
xfs/598 xfs/609 xfs/629 xfs/630 xfs/631 xfs/632 xfs/806
Flaky: generic/075: 40% (2/5) generic/112: 20% (1/5)
generic/455: 20% (1/5) generic/475: 80% (4/5)
generic/733: 80% (4/5) xfs/168: 60% (3/5)
- Ted