zfs RFC: generic device removal

This is a RFC for generic device removal

Compatible with every kind of top-level vdev.

TL;DR

Generic device removal that is operating on leaf vdev basis (not on the top-level vdev like the current approach) with replacement being a zvol backed by the rest of the pool. Most of the parts needed (zvol, disable allocations from a vdev) should already be in place. The hard part would be turning the DMU async, could also supply real read AIO while at it. The ugly part would be the SPA calling back up up into the DMU (possibly recursively after repeated removals, though this can always be flattened to one level indirection, any time).

Main question is if the last part (SPA calling into DMU) would be acceptable.

Details

The gist is to implement generic top-level device removal through replacing the leaf vdevs of the to-be-removed one with a replacement volume ('rvol') which basically behaves like a disk-like vdev from the pool perspective but is backed by a zvol(-like) dataset that is stored on the remaining 'real' vdevs of the pool.

The logic for a replacement would be as follows:

Disable new allocations on the respective top-level vdev (which should remove that vdev from the FREE calculation completely as no further allocations will happen on it anymore)
For each 'disk' leaf vdev in the to-be-removed create one (sparse, identically sized to the leaf block device that is replaced) 'rvol' dataset, which as of step 1 will be allocated solely from the rest of the 'real' pool top-level vdevs members and zpool attach it to the original leaf vdev(s), turning these into mirrors (like a classical vdev replace would do).
Wait for the pool to resilver, which will (like a normal replace operation) replicate all referenced space from the 'real' disk leaf vdevs onto the new 'rvol' leaf vdevs.
After the resilver completed without error first mark the top-level vdev as 'replaced' (= no longer physically existing so it dosn't eg. need to be located on import), then detach the original 'real' block devices from the replacement mirrors.

Should the resilver fail for some reason detach the 'rvol' leaf vdevs instead, destroy these and re-enable allocation from the top-level vdev to return back to the original pool configuration. So it would be possible to abort the process for whatever reason, any time (eg. in case there is an error, like running out of pool space while resilvering) before the final step. It would also be possible to reverse an already completed replacement through zpool replace of the leaf rvol vdev(s) with 'real' vdevs and then re-enabling free space allocation on the top-level one.

Immer workings

Any I/O on a rvol vdev would go through the normal codepaths up to doing a block device I/O, which then (instead of being a physical I/O on a 'disk' vdev) would be turned into a call into the DMU to be served from the backing (zvol-)rvol.

As no new data would ever be written to a rvol vdev (after the initial replacement replication) the only writes would happen on free operations updating the spacemaps. Leveraging the existing trim support should nicely collapse any deallocated ranges into holes in the backing rvol dataset, correctly reducing needed space on-disk.

Multiple removals are no problem since backing of rvols (allocations for them) can only happen on 'real' vdevs (so CoW will make sure that the data tree stays loop-free), multiple indirections created by multiple removals (vdev backing a rvol being removed) can be flattened into a single level of indirection through running the equivalent of dd if=rvol of=rvol (with nop-write disabled to make sure the data is re-written, which will happen onto a 'real' toplevel vdev) with a compression active that collapses empty blocks into holes (to keep the rvol sparse). All code needed for this to happen already exists.

When replacement is complete the admin could decide to trade space occupied (by now redundant redundancy, eg. raidz parity drives that are no longer needed as the data integrity of the rvols is being taken care of by the redundancy of the remaining top-level vdevs) against read performance (as reads would result in rebuilds from parity). This would need some logic so replaced raidz vdevs being trimmed down to be without redundancy wouldn't show as degraded, reads that lead to data being rebuild from parity dosn't count as errors, etc. Detaching redundancy drives should be no problem though as all possible repairs (of errors that might have been on the 'real' block devices) would already be completed by the resilver that is part of the replacement process.

Should allocated space on the replaced vdev reach zero it could well be removed from the pool completely and the (then empty) child rvol(s) be destroyed, as it won't be accessed anymore (since nothing points toward it anymore). This would need some more logic for the pool to deal with a device no longer being there, so maybe it would be easier (when nothing references it anymore) to just leave a stub where the device originally was and just destroy the rvols. Basically the same problem as with the existing aproach to device removal.

Discard commands (from pool level free operations) being just forwarded to the rvol (zvol) should leverage the existing trim support to create holes there (and actually free the released space).

Implementation prerequisites

A) A 'rvol' dataset type that can be used to back rvol vdevs, main differences to a zvol would be:

can't be renamed, snapshotted or cloned (as this would likely not make any sense)
can't be destroyed or resized while backing a rvol vdev
not being listed by zfs list unless -t rvol (maybe -t all should not show it to not get in the way of existing tools), possibly located as children to a magic parent dataset (akin to .zfs in filesystem directories, just in the dataset hirarchy) or even in a separate hirarchy to make name collisions with normal datasets completely impossible

B) A 'rvol' leaf vdev type that leverages a rvol dataset (on the same pool) as backing storage, side effects of it would need to be:

disables free space allocation from its parent (up to the top-level one) vdev, to make sure any writes to a rvol will always be backed by 'real' vdevs
removes the parent top-level vdev from pool FREE calculation (as no new allocations can happen from it)
prevents any changes (eg. destroy) to the backing rvol dataset

C) A 'removed' top-level vdev flag that, for a vdev to deal with the side effects of all members being on the pool itself, would make sure the vdev:

is not needed to be available on zpool import
is not complaining about being degraded (replaced raidz being stripped of redundancy)
no longer gets uberlock updates (no need as it is no physical drive)

D) The DMU allowing real async reads (that was the only real problem @ahrens came up with when I discussed this approach with him: something about zio workers would block and exhaust the thread pool - sadly Slack ate that thread out of financial greed so I can't copy&paste what he wrote me) so a ZIO worker in the SPA can trigger a needed read from the zvol without getting blocked in the DMU (in case metadata has to be read to service the read). This would also give the basis for real AIO (instead of the current fake that only exposes the interface to userland but in reality always reads sync).

E) To teach zpool import to ignore 'missing' rvol leaf vdevs (and 'removed' top-level vdevs) while scanning for the pool configuration. This might bring the need to touch (rewrite) metadata that zpool import needs before step 4 of the setup so anything needed to make an rvol available is CoW'd onto the 'real' vdevs of the pool (so it is available even while the rvols themselves are not yet available as the pool isn't imported yet).

Pros/Cons

This approach to device removal would avoid the problems of the current one like only working for specific kinds of vdevs (or not possible at all when a raidz is present in the pool, see https://github.com/zfsonlinux/zfs/pull/6900#issuecomment-501021145) as these map at the pool level and thus have to deal with the specifics of the vdev types (and require the mapping to be kept in memory, possibly also workarounds that map unused space of the source to reduce this memory requirement when removing fragmented vdevs). While https://github.com/zfsonlinux/zfs/issues/9013 might add another special case for raidz this will likely have some limitations (like needing enough continuous free space on an identically structured vdev as replacement target).

Downside likely is higher latency for reads (as an additional trip through the DMU is needed, which is more expensive than a block-level range mapping) and possibly higher on-disk medatada requirements (to track the data in the rvol). But as CoW will always write new data to 'real' vdevs the rvol will only see reads and free operations (for the latter there already is async_destroy and trim support) and with enabling a reasonable compression scheme on the rvol (dataset side) with reasonably bigger block size... space management might be even more effective than the pool-level mapping approach of the existing approaches (on higher fragmentation).

Another downside is that it can't remap existing data (when writing new block pointers to point them to the new physical location of the data, see https://github.com/openzfs/openzfs/pull/482), so as usual only newly written data will be placed directly on 'real' vdevs.

Upside of this approach is to be completely agnostic to the kind of leaf vdev being replaced as it woud keep functionality of the the top-level vdev basically as-is (sans no more allocations and the removed need to be physically present on import). So it should work cleanly with any type of top-level vdev (disk, mirror, raidz, draid, whatever the future may bring) in any type of pool configuration - as long as there is enough free space (from the zfs perspective) to create the rvols.

This would allow to freely (or even completely) change the vdev structure of the pool. After the pool has been reshaped to the desired new layout the datasets which are latency sensible could be send a|recv b; destroy a; rename b a to move the data onto the new 'real' vdevs to completely remove the cost of indirection. Or a simple no-nop-write inplace-rewrite be done of data for which no snapshots exist. Or reduce the indirection level of this approach to a flat one (in case a physical vdev where rvol data was stored on has also been removed, which likely would happen in case of multiple sequential device removal operations).

While the initial remove would be slower (as it would use resilver to transfer the data, which is slower than a block-level copy of allocated space - though https://github.com/zfsonlinux/zfs/pull/6256 helps with this) an upside would be that the data would get checked in the process (and not garbage being copied happily, as hinted in https://github.com/openzfs/openzfs/pull/482, in case a drive decided to deliver such while the removal happend).

This functionality could also (given enough top-level vdevs in the pool and enough free space) be leveraged to provide hot spares on demand, without the need for actual dedicated drives. Could make sense as a band-aid in case of a degraded vdev for which currently no physical spare drives are available - by temporarily replacing the offline/corrupt/missing leaf vdev(s) with a (rest-)pool backed rvol the redundancy could be maintained at a healthy level until spares drive get available/arrive. Though no new allocations would happen on that vdev till it's fully back on 'real' drives (and it needs enough free space in the pool) this could be an interesting tool for an admin to advert a bigger desaster.

Comments are welcome, please discuss.

Aug 05 '19 23:08 GregorKopka

Since you’re relying on the new vdev for redundancy, and you’re reading all the block pointers, you could not copy the parity. That way you don’t waste space for it, and also don’t have to do the raidz reconstruction math.

What’s the advantage of creating one zvol-vdev for each leaf device, rather than one zvol-vdev for each top level Vdev? I think that one per top level will keep each logical block’s data close together in its new location, reducing read inflation.

There will be a bunch of space accounting weirdness, since logical blocks can now consume a different amount of space than what was charged (to their asize) when they were allocated.

~Overall, I’m not sure that using the zvol as an indirection method buys us much compared to using the indirect vdev mechanism that was introduced with (non-raidz) device removal. I think you could apply similar ideas to what you have here (eg. stripping out the removed vdev’s redundancy, fudging the accounting) within the existing vdev removal framework.~ The additional overhead of going through a zvol does let you smush together the holey data that will result when removing the parity sectors. Doing that would be less practical with the simpler indirect vdev mapping.

Aug 06 '19 03:08 ahrens

The zvol-vdev approach has a lot more space and performance overhead, compared to the indirect vdev. The most extreme example of this is turning frees into read-modify-writes (to zero out the part of the zvol’s block that is covered by the logical block).

Aug 06 '19 04:08 ahrens

Interesting idea to skip the parity instead of removing parity amount of replaced drives. As long as it wouldn't need special handling of different top-level vdev kinds (so the generic would work with all of them) it would be an interesting perspective, should different kinds of vdevs need different treatment it could be an interesting future optimization.

The idea of only replacing the block devices instead of the top-level vdev is to not having to deal with the specifics of the latter, to avoid the problems the current approaches have (not working on certain kinds of top-level vdevs, needing identically structured ones for the remove to work, ...). It's basically a KISS approach to only deal with the leafs.

Space accounting on the pool or zfs level? On the pool level my uneducated guess was ALLOC being calculated from SIZE-FREE, would it be different (so a shrunk pool could show more allocated that total space, while having some free) I would need to think about how this could be dealt with. On the zfs level the currently exposed accounting is somewhat lacking anyway, as there is no obvious (as in zfs get/listbeing able to answer this) way to query the amount of pool space that is tied up by a dataset (or what it consumes for the needed metadata) so the available information of used* is more or less useless when you need to figure out where your space went. So IMHO it wouldn't get much worse in that regard. Though I agree that without a way to query the actual consumed space of a dataset (so one could do some quick math) there would be no way to measure the real cost of the indirection.

While there certainly would be a price to pay to go through a zvol... it could be worth it. Like for also getting on-demand virtual spares that keep your pool from dying. YMMV.

My main question still is if SPA calling back into DMU would be an acceptable layer violation.

Aug 11 '19 21:08 GregorKopka

I think it's very beneficial. I mistakenly added the cache disk to toplevel. I spent a lot of time fixing this, and if this RFC were implemented, I wouldn't be bothered with the same problem.

Aug 26 '19 07:08 caoli5288

It already works for striped devices in 0.8. tank feature@device_removal active local

Aug 26 '19 08:08 tomposmiko

It already works for striped devices in 0.8.

According to @behlendorf (https://github.com/zfsonlinux/zfs/pull/6900#issuecomment-501021145) with one slight, show stopping for many pools, limitation:

The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal.

(unless this restriction has been lifted by now, then I must have missed that announcement)

My main question still is if SPA calling back into DMU would be an acceptable layer violation.

Aug 26 '19 08:08 GregorKopka

I did not know that, thanks!

Aug 26 '19 08:08 tomposmiko

On Sun, Aug 11, 2019 at 2:23 PM Gregor Kopka [email protected] wrote:

My main question still is if SPA calling back into DMU would be an acceptable layer violation.

It doesn't work in general. My gut reaction is that it's not a good idea. But you could investigate all the problems it causes and see if there are reasonable solutions.

--matt

Aug 26 '19 17:08 ahrens

Reading the subject/ RFC title, I was hopeful that replacing a badly referenced (i.e. not using UUID path) disk might be related to this feature request:

zpool replace fails when new dev name is same device - replugged external USB drive using /dev/sdX names cannot be detached/reattached #7866

And see also: Feature: zpool online to rename a disk #3242

Sep 14 '19 06:09 zenaan

The key restriction worth emphasizing for device removal, is that no top-level data device may be removed if there exists a top-level raidz vdev in the pool. Only mirror and non-redundant vdevs can be added to the pool if you intend to use device removal.

And sadly the problem I have here is that someone accidentally added a single drive as a top-level vdev to a zpool (instead of adding it as a hot spare into the raidz2 array, which was what was intended)

This means that not only can this disk no longer be removed; a proportion of writes are now going onto a completely non-protected vdev :-(

Dec 09 '19 13:12 candlerb

@candlerb That's unfortunate. You should have gotten an error when you did zpool add poolname disk, which you would have to override with -f. If that was not your experience, please let us know.

$ man zpool-add
...
             The behavior of the -f option, and the device
             checks performed are described in the zpool create subcommand.

             -f      Forces use of vdevs, even if they appear in use or spec‐
                     ify a conflicting replication level.  Not all devices can
                     be overridden in this manner.

$ man zpool-create
...
             The command also checks that the replication strategy for the
             pool is consistent.  An attempt to combine redundant and non-
             redundant storage in a single pool, or to mix disks and files,
             results in an error unless -f is specified.  The use of differ‐
             ently sized devices within a single raidz or mirror group is also
             flagged as an error unless -f is specified.

Dec 09 '19 17:12 ahrens

That was exactly what the operator did :-(

Dec 09 '19 18:12 candlerb

It seems that Oracle gave ZFS this feature in 2018 (Solaris 11.4): https://blogs.oracle.com/solaris/oracle-solaris-zfs-device-removal

Under the heading "Misconfigured Pool Device" it gives exactly the scenario I am interested in (adding a single device as a top-level vdev next to a raidz vdev, and then removing it again)

However, the manpage for zfs 0.8.0 says:

When the primary pool storage includes a top-level raidz vdev only hot spare, cache, and log devices can be removed.

Did Oracle implement vdev removal in a more generic way?

Dec 10 '19 16:12 candlerb

@mattmacy

Re: POSIX AIO for async dmu Would people be interested in using it if it existed? If so, for what use-case?

This RFC could be a use case for async dmu.

Jul 09 '20 21:07 GregorKopka

As a newcomer to ZFS and a complete n00b I find it utterly bizarre that it's impossible to remove an arbitrary vdev from a zpool - IMHO it should be like one of its basic features.

Zpools are operated by people and they do make mistakes, constantly. One mistaken '-f' param while adding vdev to the zpool shouldn't make the pool permanently broken.

I don't fully understand how ZFS works, so can't really recommend a solution. However, I would really love to know what is the fundamental reason for the inability of removing top-level vdev from a zpool containing a raidz vdev - especially in the case referenced by @candlerb in https://github.com/openzfs/zfs/issues/9129#issuecomment-563233467

May 24 '21 23:05 tendersoft-mjb

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

May 25 '22 22:05 stale[bot]

One thing I don't understand is why special top-level devices cannot be removed when there are raidz primary devices. I mean, I (think I) understand why raidz top-level devices cannot be removed but the special devices contain data that is essentially off-loaded from the primary vdevs for performance reasons so it should, in theory at least, be possible to transfer the data from specials to the primaries without loss of any features, provided there's enough space available of course.

Jun 23 '23 19:06 jficz

Would the inverse of https://github.com/openzfs/zfs/pull/15022 be possible with this? Shrinking an raidz2 as long as the free space is available, to eventually allow for an upgrade to bigger drives but less of them?

Nov 21 '25 10:11 BoBeR182

Would the inverse of #15022 be possible with this? Shrinking an raidz2 as long as the free space is available, to eventually allow for an upgrade to bigger drives but less of them?

Not in the sense of reducing the stripe width of an existing raidz vdev.

Though it would be possible to evacuate all data from an existing raidz vdev (given enough space on the other vdevs in the pool), so the physical disks end up being disconnected from the pool and can be repurposed (even to create another vdev in the pool). Downsides of the approach (indirection while reading data from the removed vdev) would then apply, but could be removed by a trip through send|recv afterwards.

Nov 21 '25 10:11 GregorKopka

I left a comment with an example situation here https://github.com/openzfs/zfs/issues/9013#issuecomment-3562360461 I guess this RFC would not cover that.

Nov 21 '25 10:11 BoBeR182

Would the inverse of #15022 be possible with this?

@BoBeR182 No. RAIDZ shrinking (stripe width reduction) is impossible even theoretically, since some disks would get two blocks from one row, loss of which would be fatal.

Nov 21 '25 14:11 amotin