zfs RAIDZ Expansion feature

RAIDZ Expansion feature

Open ahrens opened this issue 3 years ago • 189 comments

Motivation and Context

This feature allows disks to be added one at a time to a RAID-Z group, expanding its capacity incrementally. This feature is especially useful for small pools (typically with only one RAID-Z group), where there isn't sufficient hardware to add capacity by adding a whole new RAID-Z group (typically doubling the number of disks).

For additional context as well as a design overview, see my talk at the 2021 FreeBSD Developer Summit (video) (slides), and a news article from Ars Technica.

Description

Initiating expansion

A new device (disk) can be attached to an existing RAIDZ vdev, by running zpool attach POOL raidzP-N NEW_DEVICE, e.g. zpool attach tank raidz2-0 sda. The new device will become part of the RAIDZ group. A "raidz expansion" will be initiated, and the new device will contribute additional space to the RAIDZ group once the expansion completes.

The feature@raidz_expansion on-disk feature flag must be enabled to initiate an expansion, and it remains active for the life of the pool. In other words, pools with expanded RAIDZ vdevs can not be imported by older releases of the ZFS software.

During expansion

The expansion entails reading all allocated space from existing disks in the RAIDZ group, and rewriting it to the new disks in the RAIDZ group (including the newly added device).

The expansion progress can be monitored with zpool status.

Data redundancy is maintained during (and after) the expansion. If a disk fails while the expansion is in progress, the expansion pauses until the health of the RAIDZ vdev is restored (e.g. by replacing the failed disk and waiting for reconstruction to complete).

The pool remains accessible during expansion. Following a reboot or export/import, the expansion resumes where it left off.

After expansion

When the expansion completes, the additional space is avalable for use, and is reflected in the available zfs property (as seen in zfs list, df, etc).

Expansion does not change the number of failures that can be tolerated without data loss (e.g. a RAIDZ2 is still a RAIDZ2 even after expansion).

A RAIDZ vdev can be expanded multiple times.

After the expansion completes, old blocks remain with their old data-to-parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distributed among the larger set of disks. New blocks will be written with the new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded once to 6-wide, has 4 data to 2 parity). However, the RAIDZ vdev's "assumed parity ratio" does not change, so slightly less space than is expected may be reported for newly-written blocks, according to zfs list, df, ls -s, and similar tools.

Manpage changes

zpool-attach.8:

NAME
     zpool-attach — attach new device to existing ZFS vdev

SYNOPSIS
     zpool attach [-fsw] [-o property=value] pool device new_device

DESCRIPTION
     Attaches new_device to the existing device.  The behavior differs depend‐
     ing on if the existing device is a RAIDZ device, or a mirror/plain
     device.

     If the existing device is a mirror or plain device ...

     If the existing device is a RAIDZ device (e.g. specified as "raidz2-0"),
     the new device will become part of that RAIDZ group.  A "raidz expansion"
     will be initiated, and the new device will contribute additional space to
     the RAIDZ group once the expansion completes.  The expansion entails
     reading all allocated space from existing disks in the RAIDZ group, and
     rewriting it to the new disks in the RAIDZ group (including the newly
     added device).  Its progress can be monitored with zpool status.

     Data redundancy is maintained during and after the expansion.  If a disk
     fails while the expansion is in progress, the expansion pauses until the
     health of the RAIDZ vdev is restored (e.g. by replacing the failed disk
     and waiting for reconstruction to complete).  Expansion does not change
     the number of failures that can be tolerated without data loss (e.g. a
     RAIDZ2 is still a RAIDZ2 even after expansion).  A RAIDZ vdev can be
     expanded multiple times.

     After the expansion completes, old blocks remain with their old data-to-
     parity ratio (e.g. 5-wide RAIDZ2, has 3 data to 2 parity), but distrib‐
     uted among the larger set of disks.  New blocks will be written with the
     new data-to-parity ratio (e.g. a 5-wide RAIDZ2 which has been expanded
     once to 6-wide, has 4 data to 2 parity).  However, the RAIDZ vdev's
     "assumed parity ratio" does not change, so slightly less space than is
     expected may be reported for newly-written blocks, according to zfs list,
     df, ls -s, and similar tools.

Status

This feature is believed to be complete. However, like all PR's, it is subject to change as part of the code review process. Since this PR includes on-disk changes, it shouldn't be used on production systems before it is integrated to the OpenZFS codebase. Tasks that still need to be done before integration:

[ ] Cleanup ztest code
[ ] Additional code cleanup (address all XXX comments)
[ ] Document the high-level design in a "big theory statement" comment
[ ] Remove/disable verbose logging
[ ] Few last test failures
[ ] remove first commit (needed to get cleaner test runs)

Acknowledgments

Thank you to the FreeBSD Foundation for commissioning this work in 2017 and continuing to sponsor it well past our original time estimates!

Thanks also to contributors @FedorUporovVstack, @stuartmaybee, @thorsteneb, and @fmstrat for portions of the implementation.

Sponsored-by: The FreeBSD Foundation Contributions-by: Stuart Maybee [email protected] Contributions-by: Fedor Uporov [email protected] Contributions-by: Thorsten Behrens [email protected] Contributions-by: Fmstrat [email protected]

How Has This Been Tested?

Tests added to the ZFS Test Suite, in addition to manual testing.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the ZFS on Linux code style requirements.
[x] I have updated the documentation accordingly.
[x] I have read the contributing document.
[x] I have added tests to cover my changes.
[ ] All new and existing tests passed.
[ ] All commit messages are properly formatted and contain Signed-off-by.

Jun 11 '21 05:06 ahrens

Congrats on the progress!

Jun 11 '21 05:06 Evernow

Also congrats on moving this out of alpha.

I have a question regarding the

After the expansion completes, old blocks remain with their old data-to-parity ratio

section and it might help to do a little example here:

If I have a 8x1TB RAIDZ2 and load 2 TB into it, it would be 33% full.

In comparison, if I start with 4x1TB RAIDZ2 which is full and expand it to 8x1TB RAIDZ2, it would be 50% full.

Is that understanding correct? If so, it would mean that one should always start an expansion with as empty a vdev as possible. As this will not always be an option, is there any possibility (planned) to rewrite the old data into the new parity? Would moving the data off the vdev and then back again do the job?

Jun 11 '21 07:06 cornim

Moving the data off and back on again should do it I think, since it rewrites the data (snapshots might get in the way of it). If deduplication is disabled, copying the files and removing the old version might do the trick (Or am I missing something?)

Jun 11 '21 08:06 GurliGebis

@cornim I think that math is right - that's one of the worst cases (you'd be better off starting with mirrors than a 4-wide RAIDZ2; and if you're doubling your number of drives you might be better off adding a new RAIDZ group). To take another example, if you have a 5-wide RAIDZ1, and add a disk, you'll still be using 1/5th (20%) of the space as parity, whereas newly written blocks will use 1/6th (17%) of the space as parity - a difference of 3%.

@GurliGebis Rewriting the blocks would cause them to be updated to the new data:parity ratio (e.g. saving 3% of space in the 5-wide example). Assuming there are no snapshots, copying every file, or dirtying every block (e.g. reading the first byte and then writing the same byte back) would do the trick. If there are snapshots (and you want to preserve their block sharing), you could use zfs send -R to copy the data. It should be possible to add some automation to make it easier to rewrite your blocks in the common cases.

Jun 11 '21 14:06 ahrens

It might be helpful to state explicitly whether extra space becomes available during expansion or only after expansion is completed.

Jun 11 '21 15:06 stuartthebruce

@stuartthebruce Good point. This is mentioned in the commit message and PR writeup:

When the expansion completes, the additional space is avalable for use

~But I'll add it to the manpage as well.~ Oh, It is stated in the manpage as well:

the new device will contribute additional space to the RAIDZ group once the expansion completes.

Jun 11 '21 17:06 ahrens

~But I'll add it to the manpage as well.~ Oh, It is stated in the manpage as well:

the new device will contribute additional space to the RAIDZ group once the expansion completes.

If it is not to pedantic, how about, "additional space...only after the expansion completes." The current wording leaves open the possibility that space might become available incrementally during expansion.

Jun 11 '21 17:06 stuartthebruce

One question, if i add to a 10 dirves pool 5 more with this system the 5 new ones have the new parity, and after that the old 10 ones are replaced step by step with replace command, at the end when the 10 old ones are replaced all the raid will have the new parity? when replace the old ones the parity keeps or the new parity is used, so we can revcoer extra space.

Jun 11 '21 22:06 felisucoibi

@felisucoibi I'm not sure I totally understand your question, but here's an example that may be related to yours: If you start with a 10-wide RAIDZ1 vdev, and then do 5 zpool attach operations to add 5 more disks to it, you'll then have a 15-wide RAIDZ1 vdev. If the 5 new disks were bigger than the 10 old disks, and you then zpool replace each of the 10 old disks with a new big disk, then the vdev will be able to expand to use the space of the 15 new, large disks. In any case, old blocks will continue to use the existing data:parity ratio of 9:1 (10% parity), and newly written blocks will use the new data:parity ratio of 14:1 (6.7% parity). So the difference in space used by parity is only 3.3%.

Jun 12 '21 01:06 ahrens

@felisucoibi I'm not sure I totally understand your question, but here's an example that may be related to yours: If you start with a 10-wide RAIDZ1 vdev, and then do 5 zpool attach operations to add 5 more disks to it, you'll then have a 15-wide RAIDZ1 vdev. If the 5 new disks were bigger than the 10 old disks, and you then zpool replace each of the 10 old disks with a new big disk, then the vdev will be able to expand to use the space of the 15 new, large disks. In any case, old blocks will continue to use the existing data:parity ratio of 9:1 (10% parity), and newly written blocks will use the new data:parity ratio of 14:1 (6.7% parity). So the difference in space used by parity is only 3.3%.

Thanks for the answer. So the only way to recalculate the old blocks is to rewrite the data like you suggested.

Jun 12 '21 06:06 felisucoibi

After the expansion completes, old blocks remain with their old data-to-parity ratio

First of all, this is an awesome feature, thank you. If I may ask: why aren't the old blocks rewritten to reclaim some extra space? I can imagine that redistributing data only affects a smaller portion of all data and thus is faster but the user then still has to rewrite data to reclaim storage space. Would be nice if this can be done as part of the expansion process as an extra option if people are willing to accept the extra time required? For what it's worth.

Jun 12 '21 12:06 louwrentius

ZFS has a philosophy of “don’t mess with what’s already on disk if you can avoid it. If need be go to extremes to not mess with what’s been written (memory mapping removed disks in pools of mirrors for example)”. Someone who wants old data rewritten can make that choice and send/recv, which is an easy operation. I like the way this works now.

Jun 12 '21 13:06 yorickdowne

@louwrentius I'm of the same mind - coming from the other direction though, and given the code complexity involved, I was wondering if perhaps the data redistribution component was maybe going to be listed as a subsequent PR...? I'd looked for an existing one in the event it was already out there, but it could be something that's already thought of/planned and I just wasn't able to locate it.

Given the number of components involved and the complexity of the operations that'd be necessary, especially as it'd pertain to memory and snapshots, I could see it making sense to split the tasks up. I'm imagining something like -

expansion completes
existing stripe read - how do we ensure the data read is written back to the same vdev? If within the intent log, we'd need a method to direct those writes back to a specific vdev and bypass the current vdev allocation method
stripe re-written to 'new' stripe' - Assuming there's snapshotted data on the vdev, how is that snapshots metadata updated to reflect the new block locations? Metadata on other vdevs may (likely does) point to lbas housed on this vdev, and updating a snapshot could be... problematic. Do we make it a requirement that no snapshots can exist in the pool for this operation to take place?
existing stripe freed

To me at least, the more I think about this, the more sense it'd make to have a 'pool level' rebalance/redistribution, as all existing data within a pool's vdevs is typically reliant upon one another. It'd certainly seem to simplify things compared to what I'm describing above I'd think. It also helps to solve other issues which've been longstanding, especially as it relates to performance of long lived pools, which may've had multiple vdevs added over time as the existing pool became full.

Anyway, I don't want to ramble too much - I could just see how it'd make sense at least at some level to have data redistribution be another PR.

Jun 12 '21 13:06 teambvd

Assuming there are no snapshots, copying every file, or dirtying every block (e.g. reading the first byte and then writing the same byte back) would do the trick. If there are snapshots (and you want to preserve their block sharing), you could use zfs send -R to copy the data. It should be possible to add some automation to make it easier to rewrite your blocks in the common cases.

Having an easily accessible command to rewrite all the old blocks, or preferably an option to do so as part of the expansion process would be greatly appreciated.

Jun 12 '21 15:06 mufunyo

@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few different questions:

What would be involved in re-allocating the existing blocks such that they use less space? Doing this properly - online, working with other existing features (snapshots, clones, dedup), without requiring tons of extra space - would require incrementally changing snapshots, which is a project of similar scale to RAIDZ Expansion. There are workarounds available that accomplish the same end result in restricted use cases (no snapshots? touch all blocks. plenty of space? zfs send -R).
How much benefit would it be? I gave a few examples above, but it's typically a few percent. (e.g. 5-wide -> 6-wide, you get at least 5/6th (83%) of a drive of additional usable space, and if you reallocated the existing blocks you could get a whole drive of usable space, i.e. an additional 17% of a drive, or ~3% of the whole pool. Wider RAIDZ's will see less impact (9-wide -> 10-wide, you get at least 90% of a drive of additional usable spacespace; you're missing out on 1% of the whole pool).
Why didn't I do this yet? Because it's an incredible amount of work for little benefit, and I believe that RAIDZ Expansion as designed and implemented is useful for a lot of people.

All that said, I'd be happy to be proven wrong about the difficulty of this! Such a facility could be used for many other tasks, e.g. recompressing existing data to save more space (lz4 -> zstd). If anyone has ideas on how this could be implemented, maybe we can discuss them on the discussion forum or in a new feature request. Another area that folks could help with is automating the rewrite in restricted use cases (by touching all blocks or zfs send -R).

Jun 12 '21 16:06 ahrens

@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few different questions:

What would be involved in re-allocating the existing blocks such that they use less space? Doing this properly - online, working with other existing features (snapshots, clones, dedup), without requiring tons of extra space - would require incrementally changing snapshots, which is a project of similar scale to RAIDZ Expansion. There are workarounds available that accomplish the same end result in restricted use cases (no snapshots? touch all blocks. plenty of space? zfs send -R).

How much benefit would it be? I gave a few examples above, but it's typically a few percent. (e.g. 5-wide -> 6-wide, you get at least 5/6th (83%) of a drive of additional usable space, and if you reallocated the existing blocks you could get a whole drive of usable space, i.e. an additional 17% of a drive, or ~3% of the whole pool. Wider RAIDZ's will see less impact (9-wide -> 10-wide, you get at least 90% of a drive of additional usable spacespace; you're missing out on 1% of the whole pool).

Why didn't I do this yet? Because it's an incredible amount of work for little benefit, and I believe that RAIDZ Expansion as designed and implemented is useful for a lot of people.

All that said, I'd be happy to be proven wrong about the difficulty of this! Such a facility could be used for many other tasks, e.g. recompressing existing data to save more space (lz4 -> zstd). If anyone has ideas on how this could be implemented, maybe we can discuss them on the discussion forum or in a new feature request. Another area that folks could help with is automating the rewrite in restricted use cases (by touching all blocks or zfs send -R).

I'd just like to state I greatly GREATLY appreciate the work your doing. Frankly, this is one of the things holding lots of people back from using ZFS, and having the ability to grow a zfs pool without having to add vdevs will be literally magical. I will be able to easily switch back to ZFS after this is complete. Again THANK YOU VERY MUCH!!

Jun 12 '21 19:06 Jerkysan

@ahrens As someone silently following the progress since the original PR, I also want to note that I really appreciate all the effort and commitment you put and are putting into this feature! I believe once this lands, it'll be a really valuable addition to zfs :)

Thank you

Jun 12 '21 19:06 kellerkindt

@kellerkindt @Jerkysan @Evernow Thanks for the kind words! It really makes my day to know that this work will be useful and appreciated! ❤️ 😄

Jun 12 '21 20:06 ahrens

Thanks for your work on this feature. It's exciting to finally see some progress in this area, and it will be useful for many people once released.

Do these changes lay any groundwork for future support for adding a parity disk (instead of a data disk - i.e., increasing the RAID-Z level)? Meaningfully growing the number of disks in an existing array would likely trigger a desire to increase the fault tolerance level as well.

Since the existing data is just redistributed, I understand that the old data would not have the increased redundancy unless rewritten. But I am still curious if your work that allows supporting old/new data+parity layouts simultaneously in a pool could also apply to increasing the number of parity disks (and algorithm) for future writes.

Jun 13 '21 17:06 DayBlur

I'm also incredibly stoked for this issue, and understand the decision to separate reallocation of existing data into another FR. Thanks so much for all of the hard work that went into this. I can't wait to take advantage of it.

After performing a raidz expansion is there at least an accurate mechanism to determine which objects (files, snapshots, datasets....I admit I haven't fully wrapped my head around how this impacts things, so apologies for not using the correct terms) map to blocks with the "old" data-to-parity ratio and possibly calculate the resulting space increase? I imagine many administrators desire a balance between maximizing space, benefitting from all of the benefits of ZFS (checksumming, deduplication, etc), and the flexibility of expanding storage (yes, we want to eat all the cakes) and will naturally compare this feature to other technologies, such as md raid, where growing a raid array triggers a recalculation of all parity. As such, these administrators will want to be able to plan out how to do the same on a zpool with an expanded raidz vdev without just blindly rewriting all files.

Jun 14 '21 20:06 rickatnight11

@DayBlur

Do these changes lay any groundwork for future support for adding a parity disk (instead of a data disk - i.e., increasing the RAID-Z level)? ... I am still curious if your work that allows supporting old/new data+parity layouts simultaneously in a pool could also apply to increasing the number of parity disks (and algorithm) for future writes.

Yes, that's right. This work allows adding a disk, and future work could be to increase the parity. As you mentioned, the variable, time-based geometry scheme of RAIDZ Expansion could be leveraged to know that old blocks have the old amount of parity, and new blocks have the new amount of parity. That work would be pretty straightforward.

However, the fact that the old blocks remain with the old amount of failure tolerance means that overall you would not be able to tolerate an increased number of failures, until all the old blocks have been freed. So I think that practically, it would be important to have a mechanism to at least observe the amount of old blocks, and probably also to reallocate the old blocks. Otherwise you won't actually be able to tolerate any more failures without losing [the old] data. This would be challenging to implement in the general case but as mentioned above there are some OK solutions for special cases.

Jun 14 '21 21:06 ahrens

@rickatnight11

After performing a raidz expansion is there at least an accurate mechanism to determine which objects (files, snapshots, datasets....I admit I haven't fully wrapped my head around how this impacts things, so apologies for not using the correct terms) map to blocks with the "old" data-to-parity ratio and possibly calculate the resulting space increase?

There isn't an easy way to do this, but you could get this information out of zdb. Basically you are looking for blocks whose birth time is before the expansion completion time. I think it would be relatively straightforward to make some tools to help with this problem. A starting point might be to report which snapshots were created before the parity-expansion completed, and therefore would need to be destroyed to release that space.

Jun 14 '21 21:06 ahrens

@ahrens I also wish to thank you for the awesome work!

I have some questions:

do DVAs need to change after a disk is attached to a RAIDZ vdev?
if so, how are DVA rewritten? Are you using a "placeholder" pointing to the new DVA?
if DVAs do not changes, how the relocation actually works?

Thanks.

Jun 16 '21 08:06 shodanshok

Frankly, this is one of the things holding lots of people back from using ZFS

curious what filesystem or pooling / RAID setup these imaginary people went with instead; presumably it also checks most of the boxes that ZFS does? volume manager, encryption provider, compressing filesystem, with snapshots etc..

Jun 16 '21 13:06 bghira

Frankly, this is one of the things holding lots of people back from using ZFS

curious what filesystem or pooling / RAID setup these imaginary people went with instead; presumably it also checks most of the boxes that ZFS does? volume manager, encryption provider, compressing filesystem, with snapshots etc..

The answer is "we settled"... I settled for unraid though until that point I had run ZFS for years and years. I could no longer afford to buy enough drives all at once to build a ZFS box outright. I had life obligations that I had to meet while wanting to continue to my hobby that seemingly never stops expanding in cost frankly. I needed something that could grow with me. It just simply doesn't give me the compressing file system, snapshots, and yada yada.

Basically, I had to make sacrifices to continue my hobby without literally breaking the bank. This functionality will allow me to go back to ZFS. I don't care if I have to move all the files to get them to use the new drives I'm adding and such. It's not "production critical" but I do want all the "nice things" that ZFS offers. This project will literally "give me my cake and let me eat it to". I've been waiting since it was first announced years ago and hovering over the searches waiting to see a new update. I'm already trying to figure out how I'm going to slide into this.

Jun 16 '21 13:06 Jerkysan

Frankly, this is one of the things holding lots of people back from using ZFS

curious what filesystem or pooling / RAID setup these imaginary people went with instead; presumably it also checks most of the boxes that ZFS does? volume manager, encryption provider, compressing filesystem, with snapshots etc..

As one of those imaginary people, one array went to use BTRFS, and the other array uses mdadm, lvm2, and ext4. I am happy with neither of them.

Primary driver for selection was both expansion capability and failed disk resilience.

Jun 16 '21 13:06 kylegordon

To be realistic, it's too late for this to make it to OpenZFS 2.1 in August, that means maybe OpenZFS 2.2 in August 2022, then it needs to make it into "OS of choice". TrueNAS might go for it early as it is wont to, but that's not something to bank on.

In the meantime, a "sorta expansion" with send/recv to el cheapo SAS from eBay and back definitely works. I did this from 5x8TB to 8x8TB. Didn't break the bank and now has more space than I am likely to use.

Jun 16 '21 13:06 yorickdowne

that means maybe OpenZFS 2.2 in August 2022

On one hand, while that seems like a long time away... this kind of seems like a substantial capability change. So an extra ~year of (non-production!) testing to uncover weird, edge case bugs might turn out to be useful. :smile:

Jun 16 '21 14:06 justinclift

Heh Heh Heh. There's an Ars Technica article about this now too:

https://arstechnica.com/gadgets/2021/06/raidz-expansion-code-lands-in-openzfs-master/

That's some good promo right there. :wink:

Jun 16 '21 18:06 justinclift

@louwrentius @teambvd @yorickdowne @mufunyo I think y'all are getting at a few different questions:

... There are workarounds available that accomplish the same end result in restricted use cases (no snapshots? touch all blocks. plenty of space? zfs send -R).

Can you point to a good reference on doing either of these things? The only way I can think of to "touch all blocks" would be to copy every file through some kind of dumb gzip|gzip pipeline.

Jun 17 '21 03:06 owlshrimp

zfs zfs copied to clipboard

RAIDZ Expansion feature

Motivation and Context

Description

Initiating expansion

During expansion

After expansion

Manpage changes

Status

Acknowledgments

How Has This Been Tested?

Types of changes

Checklist:

zfs
zfs copied to clipboard