device removal for RAID-Z
Describe the problem you're observing
Device removal (of top-level vdevs) doesn't work with pools that have RAIDZ vdevs. This violates the principle of "everything works with everything else", i.e. all ZFS features interoperate cleanly.
Here are some aspects of a potential solution:
- Remove checks that prevent operation with raidz, add checks that all vdevs are the same "type" (e.g. same number of disks in each raidz group, same amount of parity in each raidz group) and ensure that vdev manipulation works correctly with raidz (e.g. replacing the raidz vdev with the indirect vdev).
- On RAIDZ, allocations of "continuation" segments must be properly aligned. I.e. if the first allocation for a segment ends on child 2, then the allocation for the remainder of this segment must begin on child 3. This ensures that a "split block" (which spans two mapping segments) does not have its data and parity unevenly distributed among the blocks, which could cause us to be unable to reconstruct it if a device fails.
- On RAIDZ, allocations must be a multiple of P+1 (i.e. one more than the number of parity devices). Therefore we can only split a chunk at a multiple of P+1 sectors from the beginning of the chunk. Combined with the above requirement, this substantially constrains where the remainder of a split block can start – in the worst case it must be aligned to N*(P+1) (N=devices in group; P=parity count)
- Once RAIDZ expansion integrates (see https://github.com/zfsonlinux/zfs/pull/8853), it should be possible to support removing a RAIDZ vdev where the remaining vdevs are wider than the removed vdev.
Note: we (Delphix) are not planning to implement this, but I wanted to document the thinking we've done on the subject.
Hi,
This ensures that a "split block" (which spans two mapping segments) does not have its data and parity unevenly distributed among the blocks, which could cause us to be unable to reconstruct it if a device fails.
Is it means currently, if one vdev fails in a RAIDZ-1 configuration, data loss can be occurred ?
@HiFiPhile No. RAIDZ-N guarantees that any N disks can fail without data loss.
Note that "split blocks" are a concept that is specific to device removal, which currently doesn't work on RAID-Z. This issue is discussing the challenges to adding a new feature: making device removal work with RAID-Z. This issue does not describe bugs in the current implementation.
So i cannot shrink raidz based pool? Is this still true?
With the completion of https://github.com/openzfs/zfs/pull/15022 Would the inverse be possible?
I have a 10x10TB raidz2 pool (100TB total 80TB usable)
I would like to slowly replace my disks one by one to 16TB ones, and at 8 disks stop. Remove 2 disks with a "shrink" command, then expand the pool to the 8x16TB (128TB, 96TB usable)
Assuming during the shrinking process I have at least 20TB of free space (2x10TB of data has to be moved, or abusing the z2 part of raid) this should be theoretically possible?
Would the inverse be possible?
@BoBeR182 No. RAIDZ shrinking (stripe width reduction) is impossible even theoretically, since some disks would get two blocks from one row, loss of which would be fatal.