ZTS: raidz_expand_001_pos data errors on slow systems
System information
| Type | Version/Name |
|---|---|
| Distribution Name | Fedora |
| Distribution Version | 40 |
| Kernel Version | 6.9.9-200 |
| Architecture | x86_64 |
| OpenZFS Version | master (bd949b10bed3d99e3b40249d9c8d74a0b4304774) |
Describe the problem you're observing
I've been able to reproduce a data corruption failure in functional/raidz/raidz_expand_001_pos on very slow systems. The test basically does:
- Create a raidz[1,2,3] pool with 5 disks.
- Add a 6th disk to test raidz expansion.
- Artificially pause the raidz expansion operation
- Zero out maximum number of disks possible [1, 2 or 3] for raidz group without data loss
- Scrub the pool
- Check to see there are no errors.
The test is failing at step 6 because there are data errors:
12:20:09.47 ERROR: check_pool_status testpool errors No known data errors exited 1
12:20:10.98 pool: testpool
12:20:10.98 state: ONLINE
12:20:10.98 status: One or more devices has experienced an error resulting in data
12:20:10.98 corruption. Applications may be affected.
12:20:10.98 action: Restore the file in question if possible. Otherwise restore the
12:20:10.98 entire pool from backup.
12:20:10.98 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
12:20:10.98 scan: scrub repaired 37.7M in 00:00:38 with 1 errors on Wed Aug 7 12:20:09 2024
12:20:10.98 expand: expansion of raidz2-0 in progress since Wed Aug 7 12:18:13 2024
12:20:10.98 62.9M / 129M copied at 550K/s, 48.80% done, 00:02:02 to go
12:20:10.98 config:
12:20:10.98
12:20:10.98 NAME STATE READ WRITE CKSUM
12:20:10.98 testpool ONLINE 0 0 0
12:20:10.98 raidz2-0 ONLINE 0 0 0
12:20:10.98 /var/tmp/dev-0 ONLINE 0 0 8
12:20:10.98 /var/tmp/dev-1 ONLINE 0 0 6
12:20:10.98 /var/tmp/dev-2 ONLINE 0 0 38.0K
12:20:10.98 /var/tmp/dev-3 ONLINE 0 0 37.4K
12:20:10.98 /var/tmp/dev-4 ONLINE 0 0 8
12:20:10.98 /var/tmp/dev-5 ONLINE 0 0 8
12:20:10.98 /var/tmp/dev-6 ONLINE 0 0 6
12:20:10.98
12:20:10.98 errors: 1 data errors, use '-v' for a list <<<<< SHOULD NOT SEE ERRORS HERE
Here's the dbgmsg snippit where it can't reconstruct:
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=4 devid=6 offset=30d3c00 error=0)
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=5 devid=0 offset=30d3e00 error=0)
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2912:raidz_reconstruct(): raidz_reconstruct_expanded(row=3)
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 0 devidx 4
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 2 devidx 0
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 5 devidx 3
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:2970:raidz_reconstruct(): reconstruction not possible; too many failures
12:20:09.87 1723058408 ffff91cf003c8000 vdev_raidz.c:3182:vdev_raidz_combrec(): reconstruction failed for all num_failures
12:20:09.87 1723058408 ffff91cf62aed200 dsl_scan.c:4535:dsl_scan_sync(): scan issued 9150 blocks for testpool (14 segs) in 35014ms (avg_block_size = 15269, avg_seg_size = 9995190)
12:20:09.87 1723058408 ffff91cf62aed200 dsl_scan.c:822:dsl_scan_sync_state(): finish scan checkpoint for testpool
12:20:09.87 1723058408 ffff91cf62aed200 metaslab.c:3984:metaslab_flush(): flushing: txg 119, spa testpool, vdev_id 0, ms_id 2, unflushed_allocs 58368, unflushed_frees 53760, appended 112 bytes
12:20:09.87 1723058409 ffff91cf62aed200 dsl_scan.c:4427:dsl_scan_sync(): finish scan clearing for testpool
12:20:09.87 1723058409 ffff91cf62aed200 dsl_scan.c:4546:dsl_scan_sync(): scan issuing complete txg 120 for testpool
12:20:09.87 1723058409 ffff91cf62aed200 spa_history.c:306:spa_history_log_sync(): txg 120 scan done errors=2
Another example with zpool status -vs
13:11:11.15 pool: testpool
13:11:11.15 state: ONLINE
13:11:11.15 status: One or more devices has experienced an error resulting in data
13:11:11.15 corruption. Applications may be affected.
13:11:11.15 action: Restore the file in question if possible. Otherwise restore the
13:11:11.15 entire pool from backup.
13:11:11.15 see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
13:11:11.15 scan: scrub repaired 37.2M in 00:00:38 with 1 errors on Wed Aug 7 13:11:09 2024
13:11:11.15 expand: expansion of raidz2-0 in progress since Wed Aug 7 13:09:05 2024
13:11:11.15 73.0M / 129M copied at 593K/s, 56.57% done, 00:01:36 to go
13:11:11.15 config:
13:11:11.15
13:11:11.15 NAME STATE READ WRITE CKSUM SLOW
13:11:11.15 testpool ONLINE 0 0 0 -
13:11:11.15 raidz2-0 ONLINE 0 0 0 -
13:11:11.15 /var/tmp/dev-0 ONLINE 0 0 6 0
13:11:11.15 /var/tmp/dev-1 ONLINE 0 0 6 0
13:11:11.15 /var/tmp/dev-2 ONLINE 0 0 37.5K 0
13:11:11.15 /var/tmp/dev-3 ONLINE 0 0 37.0K 0
13:11:11.15 /var/tmp/dev-4 ONLINE 0 0 8 0
13:11:11.15 /var/tmp/dev-5 ONLINE 0 0 6 0
13:11:11.15 /var/tmp/dev-6 ONLINE 0 0 6 0
13:11:11.15
13:11:11.15 errors: Permanent errors have been detected in the following files:
13:11:11.15
13:11:11.15 /testpool/fs3/1/30
Describe how to reproduce the problem
We were hitting this pretty frequently while testing out the new "run ZTS on QEMU VM on github runner" PR (https://github.com/openzfs/zfs/pull/15838). Those tests run fewer CPU resources than the buildbot ones. It's that slowness that seems to trigger the bug. We were able to hit it there on basically every distro: Almalinux 8-9, Fedora 39-40, Ubuntu 20,22,23.
I was able to reproduce this on a local VM with
- Fedora 40 VM on 2 vCPUs w/8GB RAM.
- On the host, I reduced down the QEMU process to about 10% CPU with:
sudo cpulimit -l 10 -p [QEMU PID HERE]
Even then it didn't hit it every time. Just to give you an idea:
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:55] [FAIL]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:32] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:18] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:58] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:09] [FAIL]
This raidz_expand_001_pos failure is our last major blocker before merging https://github.com/openzfs/zfs/pull/15838
Note that it randomly chooses the parity for each run so the ones that fail could have different parity than the ones that succeed. Looks like parity=2 is common to the failures I have observed.
Seems to always be one 8K block that has unexpected zeros inside it on 512 byte boundaries.