zfs ZTS: raidz_expand_001_pos data errors on slow systems

System information

Type	Version/Name
Distribution Name	Fedora
Distribution Version	40
Kernel Version	6.9.9-200
Architecture	x86_64
OpenZFS Version	master (bd949b10bed3d99e3b40249d9c8d74a0b4304774)

Describe the problem you're observing

I've been able to reproduce a data corruption failure in functional/raidz/raidz_expand_001_pos on very slow systems. The test basically does:

Create a raidz[1,2,3] pool with 5 disks.
Add a 6th disk to test raidz expansion.
Artificially pause the raidz expansion operation
Zero out maximum number of disks possible [1, 2 or 3] for raidz group without data loss
Scrub the pool
Check to see there are no errors.

The test is failing at step 6 because there are data errors:

12:20:09.47 ERROR: check_pool_status testpool errors No known data errors exited 1

12:20:10.98   pool: testpool                                                     
12:20:10.98  state: ONLINE                                                       
12:20:10.98 status: One or more devices has experienced an error resulting in data
12:20:10.98   corruption.  Applications may be affected.                         
12:20:10.98 action: Restore the file in question if possible.  Otherwise restore the
12:20:10.98   entire pool from backup.                                           
12:20:10.98    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A       
12:20:10.98   scan: scrub repaired 37.7M in 00:00:38 with 1 errors on Wed Aug  7 12:20:09 2024
12:20:10.98 expand: expansion of raidz2-0 in progress since Wed Aug  7 12:18:13 2024
12:20:10.98   62.9M / 129M copied at 550K/s, 48.80% done, 00:02:02 to go         
12:20:10.98 config:                                                              
12:20:10.98                                                                      
12:20:10.98   NAME                STATE     READ WRITE CKSUM                     
12:20:10.98   testpool            ONLINE       0     0     0                     
12:20:10.98     raidz2-0          ONLINE       0     0     0                     
12:20:10.98       /var/tmp/dev-0  ONLINE       0     0     8                     
12:20:10.98       /var/tmp/dev-1  ONLINE       0     0     6                     
12:20:10.98       /var/tmp/dev-2  ONLINE       0     0 38.0K                     
12:20:10.98       /var/tmp/dev-3  ONLINE       0     0 37.4K                     
12:20:10.98       /var/tmp/dev-4  ONLINE       0     0     8                     
12:20:10.98       /var/tmp/dev-5  ONLINE       0     0     8                     
12:20:10.98       /var/tmp/dev-6  ONLINE       0     0     6                     
12:20:10.98                                                                      
12:20:10.98 errors: 1 data errors, use '-v' for a list   <<<<< SHOULD NOT SEE ERRORS HERE

Here's the dbgmsg snippit where it can't reconstruct:

12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=4 devid=6 offset=30d3c00 error=0)
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2053:vdev_raidz_reconstruct_row(): reconstruct(rm=ffff91cfdd9f3000 col=5 devid=0 offset=30d3e00 error=0)
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2912:raidz_reconstruct(): raidz_reconstruct_expanded(row=3)
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 0 devidx 4
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 2 devidx 0
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2957:raidz_reconstruct(): simulating failure of col 5 devidx 3
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:2970:raidz_reconstruct(): reconstruction not possible; too many failures
12:20:09.87 1723058408   ffff91cf003c8000 vdev_raidz.c:3182:vdev_raidz_combrec(): reconstruction failed for all num_failures
12:20:09.87 1723058408   ffff91cf62aed200 dsl_scan.c:4535:dsl_scan_sync(): scan issued 9150 blocks for testpool (14 segs) in 35014ms (avg_block_size = 15269, avg_seg_size = 9995190)
12:20:09.87 1723058408   ffff91cf62aed200 dsl_scan.c:822:dsl_scan_sync_state(): finish scan checkpoint for testpool
12:20:09.87 1723058408   ffff91cf62aed200 metaslab.c:3984:metaslab_flush(): flushing: txg 119, spa testpool, vdev_id 0, ms_id 2, unflushed_allocs 58368, unflushed_frees 53760, appended 112 bytes
12:20:09.87 1723058409   ffff91cf62aed200 dsl_scan.c:4427:dsl_scan_sync(): finish scan clearing for testpool
12:20:09.87 1723058409   ffff91cf62aed200 dsl_scan.c:4546:dsl_scan_sync(): scan issuing complete txg 120 for testpool
12:20:09.87 1723058409   ffff91cf62aed200 spa_history.c:306:spa_history_log_sync(): txg 120 scan done errors=2

Another example with zpool status -vs


13:11:11.15   pool: testpool                                                     
13:11:11.15  state: ONLINE                                                       
13:11:11.15 status: One or more devices has experienced an error resulting in data
13:11:11.15   corruption.  Applications may be affected.                         
13:11:11.15 action: Restore the file in question if possible.  Otherwise restore the
13:11:11.15   entire pool from backup.                                           
13:11:11.15    see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A       
13:11:11.15   scan: scrub repaired 37.2M in 00:00:38 with 1 errors on Wed Aug  7 13:11:09 2024
13:11:11.15 expand: expansion of raidz2-0 in progress since Wed Aug  7 13:09:05 2024
13:11:11.15   73.0M / 129M copied at 593K/s, 56.57% done, 00:01:36 to go         
13:11:11.15 config:                                                              
13:11:11.15                                                                      
13:11:11.15   NAME                STATE     READ WRITE CKSUM  SLOW               
13:11:11.15   testpool            ONLINE       0     0     0     -               
13:11:11.15     raidz2-0          ONLINE       0     0     0     -               
13:11:11.15       /var/tmp/dev-0  ONLINE       0     0     6     0               
13:11:11.15       /var/tmp/dev-1  ONLINE       0     0     6     0               
13:11:11.15       /var/tmp/dev-2  ONLINE       0     0 37.5K     0               
13:11:11.15       /var/tmp/dev-3  ONLINE       0     0 37.0K     0               
13:11:11.15       /var/tmp/dev-4  ONLINE       0     0     8     0               
13:11:11.15       /var/tmp/dev-5  ONLINE       0     0     6     0               
13:11:11.15       /var/tmp/dev-6  ONLINE       0     0     6     0               
13:11:11.15                                                                      
13:11:11.15 errors: Permanent errors have been detected in the following files:  
13:11:11.15                                                                      
13:11:11.15         /testpool/fs3/1/30

Describe how to reproduce the problem

We were hitting this pretty frequently while testing out the new "run ZTS on QEMU VM on github runner" PR (https://github.com/openzfs/zfs/pull/15838). Those tests run fewer CPU resources than the buildbot ones. It's that slowness that seems to trigger the bug. We were able to hit it there on basically every distro: Almalinux 8-9, Fedora 39-40, Ubuntu 20,22,23.

I was able to reproduce this on a local VM with

Fedora 40 VM on 2 vCPUs w/8GB RAM.
On the host, I reduced down the QEMU process to about 10% CPU with:

	sudo cpulimit -l 10 -p [QEMU PID HERE]

Even then it didn't hit it every time. Just to give you an idea:

Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:55] [FAIL]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [01:32] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [03:18] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:58] [PASS]
Test: /home/hutter/zfs/tests/zfs-tests/tests/functional/raidz/raidz_expand_001_pos.ksh (run as root) [04:09] [FAIL]

This raidz_expand_001_pos failure is our last major blocker before merging https://github.com/openzfs/zfs/pull/15838

Aug 07 '24 20:08 tonyhutter

Note that it randomly chooses the parity for each run so the ones that fail could have different parity than the ones that succeed. Looks like parity=2 is common to the failures I have observed.

Aug 09 '24 04:08 don-brady

Seems to always be one 8K block that has unexpected zeros inside it on 512 byte boundaries.

Aug 13 '24 19:08 don-brady