blktests blktests zbd/009 failure

Recently I found zbd/009 always failed, and after reverting commit[1], the test can pass. The tests failed with "No space left on device", but from df -h, there still has much space on the disk, could you help check it when you have a chance, thanks.

[1] 951ad8206fe04ef4708049e7a5c0db6947a44c51

[2]

zbd/009 (test gap zone support with BTRFS)                   [failed]
    runtime  9.681s  ...  9.705s
    --- tests/zbd/009.out	2024-11-19 02:15:02.202488258 +0000
    +++ /root/blktests/results/nodev/zbd/009.out.bad	2024-11-19 05:59:26.815119782 +0000
    @@ -1,2 +1,2 @@
     Running zbd/009
    -Test complete
    +Test failed

# cat results/nodev/zbd/009.full
btrfs-progs v6.11
See https://btrfs.readthedocs.io for more information.

Resetting device zones /dev/sda (256 zones) ...
NOTE: several default settings have changed in version 5.15, please make sure
      this does not affect your deployments:
      - DUP for metadata (-m dup)
      - enabled no-holes (-O no-holes)
      - enabled free-space-tree (-R free-space-tree)

Label:              (null)
UUID:               88a4aefd-8be9-4e0c-9ee2-ad1a3f20bee6
Node size:          16384
Sector size:        4096	(CPU page size: 4096)
Filesystem size:    1.00GiB
Block group profiles:
  Data:             single            4.00MiB
  Metadata:         DUP               4.00MiB
  System:           DUP               4.00MiB
SSD detected:       yes
Zoned device:       yes
  Zone size:        4.00MiB
Features:           extref, skinny-metadata, no-holes, free-space-tree, zoned
Checksum:           crc32c
Number of devices:  1
Devices:
   ID        SIZE  ZONES  PATH
    1     1.00GiB    256  /dev/sda

fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=901120, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=860160, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=425984, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=131072, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=61440, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=331776, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=77824, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=163840, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=69632, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=999424, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=548864, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=278528, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=770048, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=311296, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=327680, buflen=4096
fio: io_u error on file /root/blktests/results/tmpdir.zbd.009.vr4/mnt/verify.0.0: No space left on device: write offset=1003520, buflen=4096
fio exited with status 1
fio: verification read phase will never start because write phase uses all of runtime
4;fio-3.37;verify;0;28;715776;353819;88454;2023;3;531;9.074568;5.894974;1;1885;164.221348;60.053806;1.000000%=28;5.000000%=87;10.000000%=99;20.000000%=116;30.000000%=136;40.000000%=158;50.000000%=183;60.000000%=191;70.000000%=191;80.000000%=193;90.000000%=199;95.000000%=207;99.000000%=305;99.500000%=452;99.900000%=815;99.950000%=897;99.990000%=1335;0%=0;0%=0;0%=0;12;1889;173.295916;61.492700;0;0;0.000000%;0.000000;0.000000;716160;120140;30037;5961;5;11722;26.838560;62.466622;0;12975;486.421405;275.659557;1.000000%=87;5.000000%=175;10.000000%=218;20.000000%=309;30.000000%=374;40.000000%=403;50.000000%=456;60.000000%=514;70.000000%=552;80.000000%=610;90.000000%=741;95.000000%=905;99.000000%=1302;99.500000%=1531;99.900000%=2244;99.950000%=2539;99.990000%=8716;0%=0;0%=0;0%=0;18;12982;513.208844;280.993445;71152;93453;74.503340%;89508.312500;5379.433821;0;0;0;0;0;0;0.000000;0.000000;0;0;0.000000;0.000000;1.000000%=0;5.000000%=0;10.000000%=0;20.000000%=0;30.000000%=0;40.000000%=0;50.000000%=0;60.000000%=0;70.000000%=0;80.000000%=0;90.000000%=0;95.000000%=0;99.000000%=0;99.500000%=0;99.900000%=0;99.950000%=0;99.990000%=0;0%=0;0%=0;0%=0;0;0;0.000000;0.000000;0;0;0.000000%;0.000000;0.000000;8.401152%;33.078753%;248378;0;21;0.4%;0.8%;1.6%;3.1%;94.1%;0.0%;0.0%;0.18%;0.04%;0.01%;0.18%;0.69%;4.46%;49.69%;23.09%;16.71%;3.18%;1.69%;0.06%;0.01%;0.01%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%

# df -h
Filesystem                                            Size  Used Avail Use% Mounted on
/dev/sda                                              1.0G  7.1M  982M   1% /root/blktests/results/tmpdir.zbd.009.pAG/mnt

Nov 19 '24 06:11 yizhanglinux

@yizhanglinux Thanks for the report. This failure is interesting. I've not ever seen this failure, and the symptom looks weird.

Question, when you revert 951ad82, do you see the fio io_u errors in the full file? I guess the errors could be reported in the full file regardless of the revert. It this guess is correct, the revert just hides the failure.

For further debug, more detailed fio report is required. Could you apply the change below to the common/fio file then run zbd/009? With this change, the full file will record detailed fio debug log.

diff --git a/common/fio b/common/fio
index b9ea087..f7a0c41 100644
--- a/common/fio
+++ b/common/fio
@@ -174,7 +174,7 @@ _fio_perf() {
 # passed --runtime will override the configured $TIMEOUT, which is useful for
 # tests that should run for a specific amount of time.
 _run_fio() {
-       local args=("--output=$TMPDIR/fio_perf" "--output-format=terse" "--terse-version=4" "--group_reporting=1")
+       local args=("--group_reporting=1" "--debug=io,verify")
 
        if [[ "${TIMEOUT:-}" ]]; then
                args+=("--runtime=$TIMEOUT")

Nov 20 '24 11:11 kawasaki

Yes, the "fio: io_u error" can be seen in the full file after the revert.

Since the file 009.full is so large, I attached the file which contains the last 1000 lines of 009.full, please help check it.

# du -sh results/nodev/zbd/009.full
254M	results/nodev/zbd/009.full

009.txt

Nov 20 '24 13:11 yizhanglinux

Addd the CKI tracking issue: https://datawarehouse.cki-project.org/issue/3257

Nov 21 '24 00:11 yizhanglinux

@yizhanglinux Thanks for sharing the fio log. TL;DR, I think this failure is a known issue, and fixes delivery is planned.

I noticed that this ENOSPC looks like the known issues that @naota is chasing. It is known that zoned-btrfs causes ENOSPC when write speed is faster than reclaim speed. Based on this understanding, I tried to recreate the failure by 1) disabling kernel debug options to speed up writes and 2) extending fio runtime to increase the reclaim size, then succeeded to recreate the failure.

Recently @naota posted a fix patch. I tried this patch. It avoided the failure under some conditions, but still ENOSPC failures are observed with longer fio runtime. He will post some ENOSPC fix patches, and I expect that they will avoid the failure of this test case.

Nov 25 '24 09:11 kawasaki

Good to know it, thanks for the update.

Nov 25 '24 12:11 yizhanglinux

/cc

Dec 03 '24 03:12 zhijianli88