zfs
zfs copied to clipboard
Corrupt data after recovering from a suspended pool
System information
Type | Version/Name |
---|---|
Distribution Name | Linux |
Distribution Version | Ubuntu 18.04 |
Kernel Version | 5.3.0-26-generic 28~18.04.1~Ubuntu |
Architecture | x86_64 |
OpenZFS Version | 0.8.2-1 (and 2.1.99) |
Describe the problem you're observing
After a pool becomes suspended due to losing too many disks, some files that were written just before the pool was suspended are unrecoverable. ZFS should know if the write completed successfully, and not discard the dirty data until it is written properly.
We expect PART of this problem is that zio_flush()
sets the ZIO_FLAG_DONT_PROPAGATE
flag, so errors are not sent to the parent ZIO. Even without that, we still see this problem. We are investigating further.
Describe how to reproduce the problem
We used zinject
to FAULT
more disks than the RAID-Z configuration can withstand. After removing the zinject
handlers, and running zpool clear
there are persistent checksum errors or completely unreadable files.
We were able to better reproduce this on real hardware, by using enclosure management tools to power off multiple disks from the pool at once causing it to become faulted.
Include any warning/errors/backtraces from the system logs
Sep 09 18:46:24 ZFS kernel: mpt3sas_cm0: log_info(0x31120101): originator(PL), code(0x12), sub_code(0x0101)
Sep 09 18:46:24 ZFS kernel: scsi_io_completion_action: 67 callbacks suppressed
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#338 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#338 CDB: Write(10) 2a 00 00 00 67 05 00 00 01 00
Sep 09 18:46:24 ZFS kernel: print_req_error: 71 callbacks suppressed
Sep 09 18:46:24 ZFS kernel: blk_update_request: I/O error, dev sddg, sector 210984 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#671 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
Sep 09 18:46:24 ZFS kernel: sd 1:0:453:0: [sddg] tag#671 CDB: Write(10) 2a 00 00 13 12 9b 00 00 03 00
Sep 09 18:46:24 ZFS kernel: blk_update_request: I/O error, dev sddg, sector 9999576 op 0x1:(WRITE) flags 0x700 phys_seg 2 prio class 0
Sep 09 18:46:24 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=2 offset=5111394304 size=12288 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=2 offset=270336 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] tag#178 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] tag#178 CDB: Read(10) 28 00 cb bb ff f0 00 00 01 00
Sep 09 18:46:27 ZFS kernel: blk_update_request: I/O error, dev sddh, sector 27344764800 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
Sep 09 18:46:27 ZFS kernel: Buffer I/O error on dev sddh, logical block 3418095600, async page read
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=1 offset=14000435503104 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: Buffer I/O error on dev sddg, logical block 3418095600, async page read
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=1 offset=14000435503104 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=1 offset=14000435240960 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1fe4-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970a9044-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970ac920-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=5 offset=0 size=0 flags=100480
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=2 offset=99635200 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=2 offset=1337597952 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=1 offset=270336 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=2 offset=2480975872 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c1ce4-part1 error=5 type=1 offset=14000435240960 size=8192 flags=b08c1
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=31156936704 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=31909847040 size=24576 flags=40080c80
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=6168944640 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: zio pool=pool-name vdev=/dev/disk/by-id/wwn-0x5000cca2970c55c8-part1 error=5 type=2 offset=5111427072 size=4096 flags=180880
Sep 09 18:46:27 ZFS kernel: mpt3sas_cm0: removing handle(0x0019), sas_addr(0x5000cca2970c1ce5)
Sep 09 18:46:27 ZFS kernel: mpt3sas_cm0: enclosure logical id(0x5000ccab040d5080), slot(42)
Sep 09 18:46:27 ZFS kernel: mpt3sas_cm0: enclosure level(0x0000), connector name( 1 )
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] Synchronizing SCSI cache
Sep 09 18:46:27 ZFS kernel: sd 1:0:453:0: [sddg] Synchronize Cache(10) failed: Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Sep 09 18:46:27 ZFS kernel: WARNING: Pool 'pool-name' has encountered an uncorrectable I/O failure and has been suspended.
then the pool was resumed once the HDDs powered back up with zpool clear
# zpool status pool-name
pool: pool-name
state: ONLINE
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sat Sep 3 02:09:21 2022
823M scanned at 823M/s, 825M issued at 825M/s, 74.3G total
24.8M resilvered, 1.08% done, 0 days 00:01:31 to go
config:
NAME STATE READ WRITE CKSUM
pool-name ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
wwn-0x5000cca2970a9044 ONLINE 0 0 442 (resilvering)
wwn-0x5000cca2970ac920 ONLINE 0 0 0 (resilvering)
wwn-0x5000cca2970c1ce4 ONLINE 0 0 0 (resilvering)
wwn-0x5000cca2970c1fe4 ONLINE 0 0 234 (resilvering)
wwn-0x5000cca2970c55c8 ONLINE 0 0 84 (resilvering)
wwn-0x5000cca2970c55fc ONLINE 0 0 0
wwn-0x5000cca2970c5600 ONLINE 0 0 0
wwn-0x5000cca2970c5980 ONLINE 0 0 0
wwn-0x5000cca2970c5d3c ONLINE 0 0 0
wwn-0x5000cca2970c750c ONLINE 0 0 0
wwn-0x5000cca2970c9468 ONLINE 0 0 0
wwn-0x5000cca2970c98a4 ONLINE 0 0 0
wwn-0x5000cca2970cba40 ONLINE 0 0 0
wwn-0x5000cca2970cbd90 ONLINE 0 0 0
errors: 788 data errors, use '-v' for a list
I've ran into this on FreeBSD as well
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.