zfs icon indicating copy to clipboard operation
zfs copied to clipboard

device removal for RAID-Z

Open ahrens opened this issue 6 years ago • 6 comments

Describe the problem you're observing

Device removal (of top-level vdevs) doesn't work with pools that have RAIDZ vdevs. This violates the principle of "everything works with everything else", i.e. all ZFS features interoperate cleanly.

Here are some aspects of a potential solution:

  • Remove checks that prevent operation with raidz, add checks that all vdevs are the same "type" (e.g. same number of disks in each raidz group, same amount of parity in each raidz group) and ensure that vdev manipulation works correctly with raidz (e.g. replacing the raidz vdev with the indirect vdev).
  • On RAIDZ, allocations of "continuation" segments must be properly aligned. I.e. if the first allocation for a segment ends on child 2, then the allocation for the remainder of this segment must begin on child 3. This ensures that a "split block" (which spans two mapping segments) does not have its data and parity unevenly distributed among the blocks, which could cause us to be unable to reconstruct it if a device fails.
  • On RAIDZ, allocations must be a multiple of P+1 (i.e. one more than the number of parity devices). Therefore we can only split a chunk at a multiple of P+1 sectors from the beginning of the chunk. Combined with the above requirement, this substantially constrains where the remainder of a split block can start – in the worst case it must be aligned to N*(P+1) (N=devices in group; P=parity count)
  • Once RAIDZ expansion integrates (see https://github.com/zfsonlinux/zfs/pull/8853), it should be possible to support removing a RAIDZ vdev where the remaining vdevs are wider than the removed vdev.

ahrens avatar Jul 10 '19 16:07 ahrens

Note: we (Delphix) are not planning to implement this, but I wanted to document the thinking we've done on the subject.

ahrens avatar Jul 10 '19 16:07 ahrens

Hi,

This ensures that a "split block" (which spans two mapping segments) does not have its data and parity unevenly distributed among the blocks, which could cause us to be unable to reconstruct it if a device fails.

Is it means currently, if one vdev fails in a RAIDZ-1 configuration, data loss can be occurred ?

HiFiPhile avatar Jul 11 '19 07:07 HiFiPhile

@HiFiPhile No. RAIDZ-N guarantees that any N disks can fail without data loss.

Note that "split blocks" are a concept that is specific to device removal, which currently doesn't work on RAID-Z. This issue is discussing the challenges to adding a new feature: making device removal work with RAID-Z. This issue does not describe bugs in the current implementation.

ahrens avatar Jul 11 '19 16:07 ahrens

So i cannot shrink raidz based pool? Is this still true?

Harvie avatar Mar 07 '23 09:03 Harvie

With the completion of https://github.com/openzfs/zfs/pull/15022 Would the inverse be possible?

I have a 10x10TB raidz2 pool (100TB total 80TB usable)

I would like to slowly replace my disks one by one to 16TB ones, and at 8 disks stop. Remove 2 disks with a "shrink" command, then expand the pool to the 8x16TB (128TB, 96TB usable)

Assuming during the shrinking process I have at least 20TB of free space (2x10TB of data has to be moved, or abusing the z2 part of raid) this should be theoretically possible?

BoBeR182 avatar Nov 21 '25 10:11 BoBeR182

Would the inverse be possible?

@BoBeR182 No. RAIDZ shrinking (stripe width reduction) is impossible even theoretically, since some disks would get two blocks from one row, loss of which would be fatal.

amotin avatar Nov 21 '25 14:11 amotin