Add support for anyraid vdevs
Sponsored by: Eshtek, creators of HexOS; Klara, Inc.
Motivation and Context
For industry/commercial use cases, the existing redundancy solutions in ZFS (mirrors and RAIDZ) work great. They provide high performance, reliable, efficient storage options. For enthusiast users, however, they have a drawback. RAIDZ and mirrors will use the size of the smallest drive that is part of the vdev as the size of every drive, so they can provide their reliability guarantees. If you can afford to buy a new box of drives for your pool, like large-scale enterprise users, that's fine. But if you already have a mix of hard drives, of various sizes, and you want to use all of the space they have available while still benefiting from ZFS's reliability and featureset, there isn't currently a great solution for that problem.
Description
The goal of Anyraid is to fill that niche. Anyraid allows devices of mismatched sizes to be combined together into a single top-level vdev. In the current version, Anyraid only supports mirror-type parity, but raidz-type parity is planned for the near future.
Anyraid works by dividing each of the disks that makes up the vdev into tiles. These tiles are the same size across all disks within a given anyraid vdev. The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger. These tiles are then combined together to form the logical vdev that anyraid presents, with sets of tiles from different disks acting as mini-mirrors, allowing the reliability guarantees to be preserved. Tiles are allocated on demand; when a write comes into a part of the logical vdev that doesn't have backing tiles yet, the Anyraid logic picks the nparity + 1 disks with the most unallocated tiles, and allocates one tile from each of them. These physical tiles are combined together into one logic tile, which is used to store data for that section of the logical vdev.
One important note with this design is that we need to understand this mapping from logical offset to tiles (and therefore to actual physical disk locations) in order to read anything from the pool. As a result, we cannot store the mapping in the MOS, since that would result in a bootstrap problem. To solve this issue, we allocate a region at the start of each disk where we store the Anyraid tile map. The tile map is made up of 4 copies of all the data necessary to reconstruct the tile map. These copies are updated in rotating order, like uberblocks. In addition, each disk has a full copy of all 4 maps, ensuring that as long as any drive's copy survives, the tile map for a given TXG can be read successfully. The size of one copy of the tile map is 64MiB; that size determines the maximum number of tiles an anyraid vdev can have, which is 2^24. This is made up of up to 2^8 disks, and up to 2^16 tiles per disk. This does mean that the largest device that can be fully used by an anyraid vdev is 1024 times the size of the smallest disk that was present at vdev creation time. This was considered to be an acceptable tradeoff, though it is a limit that could be alleviated in the future if needed; the primary difficulty is that either the tile map needs to grow substantially, or logic needs to be added to handle/prevent the tile map filling up.
Anyraid vdevs support all the operations that normal vdevs do. They can be resilvered, removed, and scrubbed. They also support expansion; new drives can be attached to the anyraid vdev, and their tiles will be used in future allocations. There is currently no support for rebalancing tiles onto new devices, although that is also planned. VDEV Contraction is also planned for the future.
New ZDB functionality was added to print out information about the anyraid mapping, to aid in debugging and understanding. A number of tests were also added, and ztest support for the new type of vdev was implemented.
How Has This Been Tested?
In addition to the tests added to the test suite and zloop runs, I also ran many manual tests of unusual configurations to verify that the tile layout behaves correctly. There was also some basic performance testing to verify that nothing was obviously wrong. Performance is not the primary design goal of anyraid, however, so in-depth analysis was not performed.
Types of changes
- [ ] Bug fix (non-breaking change which fixes an issue)
- [x] New feature (non-breaking change which adds functionality)
- [ ] Performance enhancement (non-breaking change which improves efficiency)
- [x] Code cleanup (non-breaking change which makes code smaller or more readable)
- [ ] Quality assurance (non-breaking change which makes the code more robust against bugs)
- [ ] Breaking change (fix or feature that would cause existing functionality to change)
- [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
- [x] Documentation (a change to man pages or other documentation)
Checklist:
- [x] My code follows the OpenZFS code style requirements.
- [x] I have updated the documentation accordingly.
- [x] I have read the contributing document.
- [x] I have added tests to cover my changes.
- [x] I have run the ZFS Test Suite with this change applied.
- [x] All commit messages are properly formatted and contain
Signed-off-by.
Overall this is a really nice feature! I haven't looked at the code yet, but did kick the tires a little and have some comments/questions.
- Regarding:
The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger.
How did you arrive at the 16GiB min tile size? (forgive me if this is mentioned in the code comments) I ask, since it would be nice to have a smaller tile size to accommodate smaller vdevs (and give more free space, since it's rounded to tile-sized bounderies).
- We should tell the user the minimum anyraid vdev size if they pass too small a vdev. Currently the error is:
$ sudo ./zpool create tank anyraid ./8gb_file1 ./8gb_file2
cannot create 'tank': one or more devices is out of space
-
We should document that
autoexpand=on|offis ignored by anyraid to mitigate any confusion/ambiguity. -
I was able to create an anyraid1 pool with an anyraid1 special device, which is nice. However, I could not create an anyraid1 pool with a mirror special device, even though they're the same redundancy level (special devices must have same redundancy level as the pool). We should update the checks to allow mirror/raidz/anyraid/anyraidz equivalent redundancy levels with special vdevs.
-
This PR uses
anyraid, anyraid0, anyraid1, anyraid2naming for the TLD type. What if we copied the current "mirror"/"raidz" naming convention, like?
anymirror, anymirror0, anymirror1, anymirror2
anyraidz, anyraidz1
That way there's no ambiguity if the anyraid TLD is a mirror or raidz flavor. It also opens the path to anyraidz1, which was mentioned in the Anyraid announcement:
"With ZFS AnyRaid, we will see at least two new layouts added: AnyRaid-Mirror and AnyRaid-Z1. The AnyRaid-Mirror feature will come first, and will allow users to have a pool of more than two disks of varying sizes while ensuring all data is written to two different disks. The AnyRaid-Z1 feature will apply the same concepts of ZFS RAID-Z1, but while supporting mixed size disks."
https://hexos.com/blog/introducing-zfs-anyraid-sponsored-by-eshtek
I also noticed that the anyraid TLD names don't include the parity level. They all just say "anyraid":
anyraid-0 ONLINE 0 0 0
We should have it match the raidz TLD convention where the parity level is included:
raidz1-0 ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
- Regarding:
The size of a tile is 1/64th of the size of the smallest disk present at creation time, or 16GiB, whichever is larger.
How did you arrive at the 16GiB min tile size? (forgive me if this is mentioned in the code comments) I ask, since it would be nice to have a smaller tile size to accommodate smaller vdevs (and give more free space, since it's rounded to tile-sized bounderies).
16GiB was selected mostly because that would make the minimum line up with the standard fraction (1/64th) at a 1TiB disk. That's a nice round number, and a pretty reasonable size for "a normal size disk" these days. Anything less than 1TiB is definitely on the smaller side. The other effect of this value is that with this tile size, you can have any disk up to 1PiB in size and still be able to use all the space; any disk that's more than 2^24 tiles can't all be used.
It is possible to have smaller tile sizes; we do it in the test suite a bunch. There is a tunable, zfs_anyraid_min_tile_size, that controls this.
- We should tell the user the minimum anyraid vdev size if they pass too small a vdev. Currently the error is:
$ sudo ./zpool create tank anyraid ./8gb_file1 ./8gb_file2 cannot create 'tank': one or more devices is out of space
That's fair, we could have a better error message for this case. I can work on that.
- We should document that
autoexpand=on|offis ignored by anyraid to mitigate any confusion/ambiguity.
I think autoexpand works like normal? It doesn't affect the tile size or anything, because the tile size is locked in immediately when the vdev is created, but it should affect the disk sizes like normal. Maybe the tile capacity doesn't change automatically? But that's probably a bug, if so. Did you run into this in your testing?
- I was able to create an anyraid1 pool with an anyraid1 special device, which is nice. However, I could not create an anyraid1 pool with a mirror special device, even though they're the same redundancy level (special devices must have same redundancy level as the pool). We should update the checks to allow mirror/raidz/anyraid/anyraidz equivalent redundancy levels with special vdevs.
Interesting, I will investigate why that happened. Those should be able to mix for sure.
- This PR uses
anyraid, anyraid0, anyraid1, anyraid2naming for the TLD type. What if we copied the current "mirror"/"raidz" naming convention, like?anymirror, anymirror0, anymirror1, anymirror2 anyraidz, anyraidz1That way there's no ambiguity if the anyraid TLD is a mirror or raidz flavor. It also opens the path to anyraidz1, which was mentioned in the Anyraid announcement:
...
I'm open to new naming options. My vague plan was to use anyraidz{1,2,3} for the RAID-Z-style parity when that support is added. But having mirror-parity have a clearer name does probably make sense. I'm open to anymirror; I was also think about anyraidm as a possibility.
I also noticed that the anyraid TLD names don't include the parity level. They all just say "anyraid":
anyraid-0 ONLINE 0 0 0
Good point, I will fix that too.
This is a nit, but in the description I believe there is a typo:
"Anyraid works by diving each of the disks that makes up the vdev..."
I believe the intent was for dividing.
- The validation logic will need to be tweaked to allow differing numbers of vdevs per anyraid TLD:
$ truncate -s 30G file1_30g
$ truncate -s 40G file2_40g
$ truncate -s 20G file3_20g
$ truncate -s 35G file4_35g
$ truncate -s 35G file5_35g
$ sudo ./zpool create tank anyraid ./file1_30g ./file2_40g ./file3_20g anyraid ./file4_35g ./file5_35g
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: both 3-way and 2-way anyraid vdevs are present
- The
anyraidTLD type string needs checks as well:
$ ./zpool create tank anyraid-this_should_not_work ./file1_30g
$ sudo ./zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
anyraid0-0 ONLINE 0 0 0
/home/hutter/zfs/file1_30g ONLINE 0 0 0
errors: No known data errors
- Regarding:
This PR uses anyraid, anyraid0, anyraid1, anyraid2 naming for the TLD type. What if we copied the current "mirror"/"raidz" naming convention, like? anymirror, anymirror0, anymirror1, anymirror2, anyraidz, anyraidz1
I'm open to new naming options. My vague plan was to use anyraidz{1,2,3} for the RAID-Z-style parity when that support is added. But having mirror-parity have a clearer name does probably make sense. I'm open to anymirror; I was also think about anyraidm as a possibility.
I prefer the anymirror name over anyraidm, just to keep convention with mirror. Same with my preference for the future anyraidz name for the same reasons.
- I don't know if this has anything to do with this PR, but I notice the
rep_dev_sizein the JSON was a little weird. Here I create an anyraid pool with 30GB, 40GB, and 20GB vdevs:
$ sudo ./zpool status -j | jq
...
"vdevs": {
"/home/hutter/zfs/file1_30g": {
"name": "/home/hutter/zfs/file1_30g",
"vdev_type": "file",
"guid": "2550367119017510955",
"path": "/home/hutter/zfs/file1_30g",
"class": "normal",
"state": "ONLINE",
"rep_dev_size": "16.3G",
"phys_space": "30G",
...
},
"/home/hutter/zfs/file2_40g": {
"name": "/home/hutter/zfs/file2_40g",
"vdev_type": "file",
"guid": "17589174087940051454",
"path": "/home/hutter/zfs/file2_40g",
"class": "normal",
"state": "ONLINE",
"rep_dev_size": "16.3G",
"phys_space": "40G",
...
},
"/home/hutter/zfs/file3_20g": {
"name": "/home/hutter/zfs/file3_20g",
"vdev_type": "file",
"guid": "6265258539420333029",
"path": "/home/hutter/zfs/file3_20g",
"class": "normal",
"state": "ONLINE",
"rep_dev_size": "261M",
"phys_space": "20G",
...
I'm guessing first two vdevs report a rep_dev_size of 16.3G due to tile alignment. What I don't get is the 261M value for the 3rd vdev. I would have expected a 16.3GB value there.
- The validation logic will need to be tweaked to allow differing numbers of vdevs per anyraid TLD:
Done, and added a test
- The
anyraidTLD type string needs checks as well:
Done, and added a test
- I don't know if this has anything to do with this PR, but I notice the
rep_dev_sizein the JSON was a little weird. Here I create an anyraid pool with 30GB, 40GB, and 20GB vdevs:
This one is a little interesting, there is a small bug here. rep_dev_size is the minimum size for a replacement device. Anyraid could in theory allow devices to be replaced by smaller devices, as long as they are still big enough to hold all the tiles the original had. The code to allow that was written on the anyraid side, but the normal vdev replacement logic doesn't really allow for that to happen cleanly, since it involves shrinking the top level vdev. I'm removing that functionality for now, it could be re-implemented later as a separate feature.
I did some more testing today looking for edge cases:
- This is the error when creating an anyraid pool with devices that are too small:
$ sudo ./zpool create tank anyraid ./file1 ./file2 ./file3
cannot create 'tank': invalid argument for this pool operation
The error message should match the convention we use for other raid types:
$ sudo ./zpool create tank ./file1 ./file2 ./file3
cannot create 'tank': one or more devices is less than the minimum size (64M)
- I'm unable to create a anyraid with a 1MB min tile size and 100MB disks:
$ cat /sys/module/zfs/parameters/zfs_anyraid_min_tile_size
1048576
$ truncate -s 100M file{1..10}
$ sudo ./zpool create tank anyraid ./file1 ./file2 ./file3
cannot create 'tank': invalid argument for this pool operation
- Using a 1MB zfs_anyraid_min_tile_size, with the following disks:
file1 100M file2 17PB file3 17PB
I get this error:
$ sudo ./zpool create tank anyraid ./file1 ./file2 ./file3
cannot create 'tank': one or more anyraid devices cannot store any tiles
When I change file1 to 200MB, I get this error:
$ sudo ./zpool create tank anyraid ./file1 ./file2 ./file3
cannot create 'tank': one or more devices is out of space
I assume these are both because file1 is too small. If so, the error messages should print the smallest disk size allowed for the config.
- When I create this pathologically mismatched pool with all the defaults, it OOMs/crashes my VM:
$ truncate -s 20G file1 && truncate -s 999P file{2..3}
sudo ./zpool create tank anyraid ./file{1..3}
<OOM>
- ZDB is saying my vdev is not any anyraid vdev for some reason:
$ sudo ./zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
anyraid1-0 ONLINE 0 0 0
/home/hutter/zfs/file1 ONLINE 0 0 0
/home/hutter/zfs/file2 ONLINE 0 0 0
/home/hutter/zfs/file3 ONLINE 0 0 0
errors: No known data errors
$ sudo ./zdb --anyraid-map tank /home/hutter/zfs/file1
AnyRAID tiles:
Not an anyraid vdev: /home/hutter/zfs/file1
- I'm not sure what to make of this output:
$ sudo ./zdb --anyraid-map tank
AnyRAID tiles:
vdev 0 tile_size 4000000000
tiles 1 checkpoint tile 4294967295
---------------- ------------ -------------
tile 0 offset 0000 disk 00
tile 0 offset 0000 disk 01
Is the tile_size value 4 billion bytes, or is it hex? What do these other fields represent?
- Normally I'm not allowed to mix parity levels in the same pool, but anyraid allows it:
$ sudo ./zpool create tank raidz1 ./file{1..4} raidz2 ./file{5..8}
invalid vdev specification
use '-f' to override the following errors:
mismatched replication level: raidz and raidz vdevs with different redundancy, 1 vs. 2 are present
$ sudo ./zpool create tank anyraid1 ./file{1..4} anyraid2 ./file{5..8}
$ sudo ./zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
anyraid1-0 ONLINE 0 0 0
/home/hutter/zfs/file1 ONLINE 0 0 0
/home/hutter/zfs/file2 ONLINE 0 0 0
/home/hutter/zfs/file3 ONLINE 0 0 0
/home/hutter/zfs/file4 ONLINE 0 0 0
anyraid2-1 ONLINE 0 0 0
/home/hutter/zfs/file5 ONLINE 0 0 0
/home/hutter/zfs/file6 ONLINE 0 0 0
/home/hutter/zfs/file7 ONLINE 0 0 0
/home/hutter/zfs/file8 ONLINE 0 0 0
errors: No known data errors
- The anyraid* string needs to be checked. This is allowed:
$ sudo ./zpool create tank anyraid_hello_world ./file{1..4}
$ sudo ./zpool status
pool: tank
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
anyraid0-0 ONLINE 0 0 0
/home/hutter/zfs/file1 ONLINE 0 0 0
/home/hutter/zfs/file2 ONLINE 0 0 0
/home/hutter/zfs/file3 ONLINE 0 0 0
/home/hutter/zfs/file4 ONLINE 0 0 0
errors: No known data errors
- These vdev props should be documented in
man/man7/vdevprops.7
sudo ./zpool get all tank all-vdevs | grep anyraid
...
/home/hutter/zfs/file4 anyraid_region_capacity 1 -
/home/hutter/zfs/file4 anyraid_region_count 0 -
/home/hutter/zfs/file4 anyraid_region_size 16G -
Also, if by 'region' in these names you're referring to the 'tile', then we should rename these to 'tile' to match convention.
- 'anyraid*' should be flagged as a restricted pool name for new pools:
$ sudo ./zpool create raidz1 ./file1 ./file2
cannot create 'raidz1': name is reserved
pool name may have been omitted
$ sudo ./zpool create anyraid1 ./file1 ./file2
$
I did some more testing today looking for edge cases:
1. This is the error when creating an anyraid pool with devices that are too small:
I can't reproduce this on the latest version; what sizes were file{1,2,3} ? If they're below 64M I get the normal error, and if they're below the tile size I get cannot create 'test': one or more anyraid devices cannot store any tiles.
2. I'm unable to create a anyraid with a 1MB min tile size and 100MB disks:
This is expected; storing the anyraid mappings takes 256MB. Add that to the label size and the tile size, and that's the minimum size of an anyraid vdev. Plus, the tile size cannot be smaller than 16MiB. I'll add that to the man page.
That's probably worth discussing, actually. My thinking there is that because a metaslab is capped at the size of a tile (so we don't have IOs that cross tile boundaries), we can't let the metaslabs get too small. On the other hand, we do allow metaslabs to be about 4MiB if you have a min-size device, so perhaps that limitation is too aggressive.
3. Using a 1MB zfs_anyraid_min_tile_size, with the following disks:
Same problem as above.
I assume these are both because file1 is too small. If so, the error messages should print the smallest disk size allowed for the config.
Returning a value like this from the anyraid code all the way to userland turns out to be a real pain. And from userland we can't necessarily do the math ourselves, since we don't have easy access to the current value of the min_tile_size. We could refer to that tunable in the message, but now it's starting to get pretty wordy; "one or more devices is smaller than the minimum size (260.5 MiB + the current value of zfs_anyraid_min_tile_size)"? And to get even that we need to know which device was too small (what if the too-small device was part of a non-anyraid vdev?), which is not something we have an easy way to expose.
4. When I create this pathologically mismatched pool with all the defaults, it OOMs/crashes my VM:
Fixed, we weren't actually checking the limits on maximum tiles per vdev, which was causing the problem.
5. ZDB is saying my vdev is not any anyraid vdev for some reason:
Currently, leaf vdevs aren't printed by this code, only the actual anyraid vdevs. I can change it to also accept the leaf vdev though, and print information about the parent.
6. I'm not sure what to make of this output:
Clarified the output for tile size, and I'm having it omit the checkpoint info if there's no checkpoint. That's used during the rollback process to discard newer tiles.
7. Normally I'm not allowed to mix parity levels in the same pool, but anyraid allows it:
Fixed, I thought I had already handled this but that was in a separate branch where I'm working on the next phase.
8. The anyraid* string needs to be checked. This is allowed:
Fixed, as discussed in the earlier feedback.
9. These vdev props should be documented in `man/man7/vdevprops.7`
Done.
Also, if by 'region' in these names you're referring to the 'tile', then we should rename these to 'tile' to match convention.
Done, and fixed a few other places where variable names/prefixes were wrong.
10. 'anyraid*' should be flagged as a restricted pool name for new pools:
Good call, done.
@pcd1193182 I apologize for my tardiness, as I only just found the time to watch the leadership meeting introducing anyraid. This is wrt the throughput concern in the single-writer case, where in (eg.) a two-disk-mirror anyraid that writer will only ever be writing to two given tiles [1] on two given disks at any one time. I have a potential solution, though I expect everyone will recoil in horror at it.
What seems to be lost in anyraid is that the toplevel of the pool is able to raid-0 stripe writes across all vdevs, where as in anyraid you are essentially creating a series of mirror vdevs which are hidden from the toplevel stripe. They must be filled completely in sequential order. It's as if you had a rather naive operator adding a series of normal mirror vdevs to a normal pool, taking action only when each one filled completely to capacity (and who never touches old data).
The suggestion: You would pre-generate the mirror/raidZ mappings slightly in advance, say for a complete "round" [2] at a time. You would then internally in the anyraid borrow the toplevel pool's raid-0 logic [3], striping/swizzling writes coming into the anyraid across each set of mirror/raidZ mappings. This would very much suck if you have very unbalanced disk sizes as in [2] below (which has high contention on single disk C and would probably be no better than existing sequential anyraid) but may be significantly less bad and more balanced for configurations like {4TB, 4TB, 8TB, 8TB} or {4TB, 4TB, 4TB}. In this way, you don't have a series of temporary mirrors (or raidZs), but a series of raid-0 stripes containing balanced sets of overlapping mirrors (or raidZs) with more parallel writing that is more balanced across more disks.
Admittedly, the idea of inserting raid-0 vdevs into the middle of the stack is extremely unpleasant one. That said, with proper implementation of the mirror or raidz logic by the anyraid my guess(?) is that it's no more hazardous than the existing architecture of anyraid or a normal zpool. The conditions for failure of the array seem the same (to be verified). I expect there will probably be a few funny quirks though. ~~This is also likely far more palatable than my initial idea of anyraid offering up multiple vdevs on a silver platter to the toplevel stripe, which would expose the anyraid's internal scheduling to it and make a colossal mess.~~
[1] 64GB regions, allocated from available disks by anyraid [2] in the case of 3 disks {A:4TB, B:4TB, C:8TB} one round would be an allocation of a pair of mirror tiles from A+C and then B+C [3] perhaps it could be borrowed like mirror and raidz logic is borrowed by draid and anyraid, by making it organizationally more like a raid-0 vdev?
@pcd1193182 I apologize for my tardiness, as I only just found the time to watch the leadership meeting introducing anyraid.
No worries, thanks for your feedback!
The suggestion: You would pre-generate the mirror/raidZ mappings slightly in advance, say for a complete "round" [2] at a time.
So, a "round" is not a thing that we have any real concept of inside of anyraid. It's one of those things that's easy to think about in specific cases in your head, but hard to define in the general case in the code. The simplest definition is probably "a round is N allocations that result in at least one allocation to every disk". Which works fine, except that with extreme vdev layouts this can result in you allocating a lot of tiles. Consider a pair of 10T disks and three 1T disks. No tiles will be allocated on the 1T disks until the 10T disks are most of the way full.
"Surely, Paul, this is just a result of the choice of the algorithm for selecting tiles. Wouldn't a different algorithm fix this problem?" you might justifiably ask. Let's say we go by capacity %age first, and then by raw amount of free space. This does result in every disk getting a tile allocated early on... which can result in less efficient space usage in some extreme scenarios (a disk with 10 tiles and 10 disks with one tile should be able to store 10 tiles in parity=1, but with this algorithm stores only 6). Even for more normal cases, like 10T and 1T disks, sure the first round will be just a quick set of tiles from each disk. But the second set will go back to being several from the 10T disks before there are any more from the 1T disks.
Now, it is probably not worth optimizing for extreme layouts when in practice, not many disks are only going to store a few tiles. But even with a "perfect" selection algorithm, though, this only goes so far, as we will discuss. The best we can do, unfortunately, is smooth things out a little bit.
You would then internally in the anyraid borrow the toplevel pool's raid-0 logic
To dive into the internals a bit here, there is no "raid-0 logic" in the vdev code. There is no raid-0 top-level vdev to borrow from, because there are no raid-0 top level vdevs. The top-level vdevs are actually combined together in the metaslab code, where we have the rotors in the metaslab classes that are used to move allocations between top level vdevs, and the allocation throttles and queues that control how much goes to each one. This is where we'd want to hook in probably.
striping/swizzling writes coming into the anyraid across each set of mirror/raidZ mappings. This would very much suck if you have very unbalanced disk sizes ... but may be significantly less bad and more balanced for configurations like ...
It's worth noting that because of the way that allocators work in ZFS, you don't actually just write to a single tile at a time. Different metaslabs get grabbed by different allocators, and those can and will be on different tiles (especially if we increase the size of metaslabs, as we've been discussing for quite a while now and it seems high time for). Writes are distributed across those roughly evenly, so writes will spread to different tiles. And so if you have relatively even disk sizes, everything is basically already going to work out for you. Tiles will mostly spread out nicely, and you'll be distributing writes across your disks, and later reads will do the same
The problem mostly only appears with very different disk sizes, and this is where we run into the fundamental problem that underlies all of this: If a disk has 10x as much space as another, we need to send 10x as many writes to it to fill it. So unless that disk is 10x as fast as the other one, it will be the performance bottleneck. We can smooth things out a little by trying to write a little bit to all the other disks while we write to that disk, but all that does is remove the lumps from the performance curve. Which is not a bad thing to do! But no matter how you slice it, if you have disks with very different sizes, you are going to have performance problems where the big disks bottleneck you.
Admittedly, the idea of inserting raid-0 vdevs into the middle of the stack is extremely unpleasant one.
Especially when you take into account that the raid-0 code (to the extent that it exists) lives at the metaslab level. This is ultimately why I didn't implement something like this. It does have a benefit, and would probably result in improved user experience, but the code when I sketched it was very unpleasant.
And one nice thing about this idea and anyraid is that it doesn't have to be part of the initial implementation. No part of the proposed idea depends on anything about the ondisk format or the fundamentals of the design, which means that it can be iterated on. If performance proves to be a problem for end users, there can be another patch that tries out this idea, or a related one. But by keeping this PR smaller and more focused on just the vdev architecture itself, we can keep the review process more focused and hopefully get things integrated more efficiently.
The initial goal for anyraid is correctness and space maximization. Performance is secondary, but it is also separable, and can be iterated on later or by others who are focused on it.
~This is also likely far more palatable than my initial idea of anyraid offering up multiple vdevs on a silver platter to the toplevel stripe, which would expose the anyraid's internal scheduling to it and make a colossal mess.~
Unfortunately, that idea is basically this idea, because of the details of ZFS's internals :)
So, a "round" is not a thing that we have any real concept of inside of anyraid. It's one of those things that's easy to think about in specific cases in your head, but hard to define in the general case in the code. The simplest definition is probably "a round is N allocations that result in at least one allocation to every disk". Which works fine, except that with extreme vdev layouts this can result in you allocating a lot of tiles. Consider a pair of 10T disks and three 1T disks. No tiles will be allocated on the 1T disks until the 10T disks are most of the way full.
I suppose one could cap the maximum number of allocations opened at once (at the cost of reduced balance) but yeah the rest of this pretty much falls apart from there. Especially since I was under the impression the toplevel raid-0 metaphor was a lot more literal than it clearly actually is.
The initial goal for anyraid is correctness and space maximization. Performance is secondary, but it is also separable, and can be iterated on later or by others who are focused on it.
Separability is a wonderful thing. :)
- It seems
anyraid_checkpoint.kshmakes my VM run out of disk space (when run in sequence as./scripts/zfs-tests.sh -T anyraid). Here are its vdev files when it runs out of disk space:
$ du -hd 1 /var/tmp/testdir/
16G /var/tmp/testdir/sparse_files
16G /var/tmp/testdir/
I think it's due to the same loopback devices being used for multiple tests, with the data never being "freed". This makes it work for me (and may help other tests):
diff --git a/tests/zfs-tests/include/libtest.shlib b/tests/zfs-tests/include/libtest.shlib
index 85fd1869e..a8cd4c5a9 100644
--- a/tests/zfs-tests/include/libtest.shlib
+++ b/tests/zfs-tests/include/libtest.shlib
@@ -789,6 +789,23 @@ function assert
(($@)) || log_fail "$@"
}
+function get_file_size
+{
+ typeset filename="$1"
+
+ if is_linux; then
+ if [ -b "$filename" ] ; then
+ filesize=$(blockdev --getsize64 $filename)
+ else
+ filesize=$(stat -c %s $filename)
+ fi
+ else
+ filesize=$(stat -s $filename | awk '{print $8}' | grep -o '[0-9]\+')
+ fi
+
+ echo $filesize
+}
+
#
# Function to format partition size of a disk
# Given a disk cxtxdx reduces all partitions
@@ -1599,6 +1616,15 @@ function create_pool #pool devs_list
if is_global_zone ; then
[[ -d /$pool ]] && rm -rf /$pool
+
+ for vdev in "$@" ; do
+ if [[ "$vdev" =~ "loop" ]] ; then
+ # If the device is a loopback, remove previously
+ # allocated data.
+ punch_hole 0 $(get_file_size /dev/$vdev) /dev/$vdev
+ fi
+ done
+
log_must zpool create -f $pool $@
fi
diff --git a/tests/zfs-tests/tests/functional/direct/dio.kshlib b/tests/zfs-tests/tests/functional/direct/dio.kshlib
index 33564ccc7..c8a6e5c00 100644
--- a/tests/zfs-tests/tests/functional/direct/dio.kshlib
+++ b/tests/zfs-tests/tests/functional/direct/dio.kshlib
@@ -261,19 +261,6 @@ function check_read # pool file bs count skip flags buf_rd dio_rd
fi
}
-function get_file_size
-{
- typeset filename="$1"
-
- if is_linux; then
- filesize=$(stat -c %s $filename)
- else
- filesize=$(stat -s $filename | awk '{print $8}' | grep -o '[0-9]\+')
- fi
-
- echo $filesize
-}
-
function do_truncate_reduce
{
typeset filename=$1
- I'm reliably hitting assertions running
anyraid_special_vdev_002_pos.kshon my VM (when run in sequence as./scripts/zfs-tests.sh -T anyraid). I've run it three times and saw two unique errors:
I saw this failure twice:
[ 1821.676037] VERIFY3U(msp->ms_weight, ==, weight) failed (558446353793941505 == 504403158265495553)
[ 1821.676736] PANIC at metaslab.c:2379:metaslab_verify_weight_and_frag()
[ 1821.677222] Showing stack for process 46410
[ 1821.677224] CPU: 1 UID: 0 PID: 46410 Comm: txg_sync Tainted: P OE 6.16.11-200.fc42.x86_64 #1 PREEMPT(lazy)
[ 1821.677226] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 1821.677227] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
[ 1821.677228] Call Trace:
[ 1821.677230] <TASK>
[ 1821.677236] dump_stack_lvl+0x5d/0x80
[ 1821.678171] spl_panic+0xf5/0x11a [spl]
[ 1821.678179] ? metaslab_segment_weight+0xc4/0x380 [zfs]
[ 1821.678521] ? metaslab_weight+0x6e/0x110 [zfs]
[ 1821.678659] metaslab_flush+0x10b/0x6f0 [zfs]
[ 1821.678798] spa_flush_metaslabs+0x20e/0x3a0 [zfs]
[ 1821.678928] spa_sync_iterate_to_convergence+0x166/0x430 [zfs]
[ 1821.679061] spa_sync+0x347/0x970 [zfs]
[ 1821.679196] txg_sync_thread+0x2f0/0x470 [zfs]
[ 1821.679323] ? __pfx_txg_sync_thread+0x10/0x10 [zfs]
[ 1821.679446] ? __pfx_thread_generic_wrapper+0x10/0x10 [spl]
[ 1821.679453] thread_generic_wrapper+0x67/0xb0 [spl]
[ 1821.679458] kthread+0xfc/0x240
[ 1821.679505] ? __pfx_kthread+0x10/0x10
[ 1821.679506] ret_from_fork+0xf1/0x110
[ 1821.679528] ? __pfx_kthread+0x10/0x10
[ 1821.679529] ret_from_fork_asm+0x1a/0x30
[ 1821.679533] </TASK>
Failed here:
/home/hutter/zfs/.libs/zpool export testpool
...
[<0>] cv_wait_common+0xea/0x2c0 [spl]
[<0>] txg_wait_synced_flags+0x148/0x300 [zfs]
[<0>] txg_wait_synced+0x10/0x40 [zfs]
[<0>] spa_export_common+0x914/0xa50 [zfs]
[<0>] zfsdev_ioctl_common+0x60b/0x860 [zfs]
[<0>] zfsdev_ioctl+0x53/0xe0 [zfs]
[<0>] __x64_sys_ioctl+0x94/0xe0
[<0>] do_syscall_64+0x7e/0x250
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
And this failure once:
[ 532.893642] VERIFY3U(zio->io_type, ==, ZIO_TYPE_WRITE) failed (1 == 2)
[ 532.894208] PANIC at vdev_anyraid.c:1046:vdev_anyraid_io_start()
[ 532.894695] Showing stack for process 37865
[ 532.894697] CPU: 5 UID: 0 PID: 37865 Comm: dmu_objset_find Tainted: P OE 6.16.11-200.fc42.x86_64 #1 PREEMPT(lazy)
[ 532.894699] Tainted: [P]=PROPRIETARY_MODULE, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
[ 532.894700] Hardware name: Red Hat KVM/RHEL-AV, BIOS 1.16.0-4.module+el8.9.0+19570+14a90618 04/01/2014
[ 532.894701] Call Trace:
[ 532.894702] <TASK>
[ 532.894705] dump_stack_lvl+0x5d/0x80
[ 532.894710] spl_panic+0xf5/0x11a [spl]
[ 532.894718] ? spl_kmem_cache_alloc+0xa5/0x2d0 [spl]
[ 532.894722] vdev_anyraid_io_start+0x579/0x590 [zfs]
[ 532.894854] ? taskq_init_ent+0x3c/0x80 [spl]
[ 532.894859] ? zio_create+0x4f4/0x940 [zfs]
[ 532.894954] zio_vdev_io_start+0x1d7/0x620 [zfs]
[ 532.895047] zio_nowait+0x141/0x3a0 [zfs]
[ 532.895147] vdev_mirror_io_start_impl+0x16d/0x240 [zfs]
[ 532.895250] vdev_mirror_io_start+0x34/0xa0 [zfs]
[ 532.895348] zio_vdev_io_start+0x4b4/0x620 [zfs]
[ 532.895446] ? _raw_spin_unlock+0xe/0x30
[ 532.895449] ? tsd_hash_search+0x93/0xc0 [spl]
[ 532.895454] zio_wait+0x16a/0x5a0 [zfs]
[ 532.895550] arc_read+0x1105/0x2480 [zfs]
[ 532.895644] zil_read_log_block.isra.0+0xbf/0x3b0 [zfs]
[ 532.895740] ? spl_kmem_cache_free+0x163/0x2c0 [spl]
[ 532.895744] ? _raw_spin_unlock+0xe/0x30
[ 532.895747] ? dbuf_hash_remove.constprop.0+0x197/0x340 [zfs]
[ 532.895843] zil_parse+0x274/0x680 [zfs]
[ 532.895938] ? __pfx_zil_claim_log_record+0x10/0x10 [zfs]
[ 532.896032] ? __pfx_zil_claim_log_block+0x10/0x10 [zfs]
[ 532.896123] ? _raw_spin_unlock+0xe/0x30
[ 532.896124] ? dnode_create+0x1bb/0x330 [zfs]
[ 532.896230] ? __cv_init+0x6e/0x180 [spl]
[ 532.896234] ? dnode_verify+0x83/0x690 [zfs]
[ 532.896335] ? dnode_special_open+0x4b/0x90 [zfs]
[ 532.896436] ? rrw_exit+0xc4/0x2f0 [zfs]
[ 532.896546] ? _raw_spin_unlock+0xe/0x30
[ 532.896547] ? spa_config_enter_impl.isra.0+0x123/0x270 [zfs]
[ 532.896650] ? _raw_spin_unlock+0xe/0x30
[ 532.896651] ? spa_config_exit+0xe4/0x1d0 [zfs]
[ 532.896753] zil_check_log_chain+0x116/0x1f0 [zfs]
[ 532.896846] dmu_objset_find_dp_impl+0x140/0x520 [zfs]
[ 532.896947] dmu_objset_find_dp_cb+0x29/0x40 [zfs]
[ 532.897046] taskq_thread+0x390/0x910 [spl]
[ 532.897052] ? __pfx_default_wake_function+0x10/0x10
[ 532.897055] ? __pfx_taskq_thread+0x10/0x10 [spl]
[ 532.897059] kthread+0xfc/0x240
[ 532.897062] ? __pfx_kthread+0x10/0x10
[ 532.897063] ret_from_fork+0xf1/0x110
[ 532.897066] ? __pfx_kthread+0x10/0x10
[ 532.897067] ret_from_fork_asm+0x1a/0x30
[ 532.897070] </TASK>
Failed here:
/home/hutter/zfs/.libs/zpool import -d /var/tmp/testdir/sparse_files testpool
...
[<0>] taskq_wait+0xa7/0x120 [spl]
[<0>] dmu_objset_find_dp+0x16b/0x230 [zfs]
[<0>] spa_load_impl.constprop.0+0x6a7/0xab0 [zfs]
[<0>] spa_load+0x70/0x120 [zfs]
[<0>] spa_load_best+0x54/0x350 [zfs]
[<0>] spa_import+0x2bb/0x800 [zfs]
[<0>] zfs_ioc_pool_import+0x14a/0x160 [zfs]
[<0>] zfsdev_ioctl_common+0x60b/0x860 [zfs]
[<0>] zfsdev_ioctl+0x53/0xe0 [zfs]
[<0>] __x64_sys_ioctl+0x94/0xe0
[<0>] do_syscall_64+0x7e/0x250
[<0>] entry_SYSCALL_64_after_hwframe+0x76/0x7e
I'm running these tests using the PR code on top of 3a55e76b84c35a26fbfba098bdadc541614ae71d.
- It seems
anyraid_checkpoint.kshmakes my VM run out of disk space (when run in sequence as./scripts/zfs-tests.sh -T anyraid). Here are its vdev files when it runs out of disk space:
Done, I made this change and another (use punch_hole instead of writing zeroes in the clean_mirror tests, also helps with this problem).
- I'm reliably hitting assertions running
anyraid_special_vdev_002_pos.kshon my VM (when run in sequence as./scripts/zfs-tests.sh -T anyraid). I've run it three times and saw two unique errors:
I don't seem to be able to reproduce this one, but I'll keep trying.
This is probably worthy of a whole other thread, but maybe I'll mention the idea here.
Being as anyraid essentially creates many mirror vdevs by tiling them across a drive's available space, could it some day be combined with something like vdev removal as the basis of an in-place rewrite mechanism? Something that would progressively evict data from/rewrite each mirror and it's tiles one by one.
It might allow for fairly significant changes in pool layout or vdevs or parity or other parameters without needing to add an entire vdev of disks and then remove the old vdev(s). You could do a more in-place operation at the tile level on the same set of disks as long as they have free tiles.
The tricky bit I guess would be how to select the right toplevel allocations/data for rewriting, such that a specific tile gets emptied out. (maybe you don't care which tile, as long as you end up with a whole tile being free.) Certainly easier to tolerate more fragmentation/wait longer for tiles to empty out if there's more free tiles on the drives.
P.S. would it make sense to make tile-ified versions of other vdev types (keeping the same data layout and behaviors) or just implement them as part of anyraid?
Being as anyraid essentially creates many mirror vdevs by tiling them across a drive's available space, could it some day be combined with something like vdev removal as the basis of an in-place rewrite mechanism? Something that would progressively evict data from/rewrite each mirror and it's tiles one by one.
It might allow for fairly significant changes in pool layout or vdevs or parity or other parameters without needing to add an entire vdev of disks and then remove the old vdev(s). You could do a more in-place operation at the tile level on the same set of disks as long as they have free tiles.
The big restriction, as with all of this stuff, is that you can't make any changes that would change the size of the data (or anything else about it that is stored in the BP). Something like changing from mirror to raidz or adding raidz parity would require not just moving the data's physical location but its logical location and size, so that previously adjacent blocks don't start overlapping with each other. And once you're changing the logical offset or size, that requires BP rewrite.
You could do things like add and remove mirror parity, but you can already do that with attach/detach. You can do rebalance and contraction (moving whole tiles around), and indeed those are planned for phase 2 of Anyraid, as I discussed in my dev summit presentation.
The tricky bit I guess would be how to select the right toplevel allocations/data for rewriting, such that a specific tile gets emptied out. (maybe you don't care which tile, as long as you end up with a whole tile being free.) Certainly easier to tolerate more fragmentation/wait longer for tiles to empty out if there's more free tiles on the drives.
If the data is rewritable (i.e. not part of a snapshot or anything), then you can just disable the metaslabs that correspond to the tiles you want to empty and then rewrite all the data using something similar to zfs rewrite. What gets complicated is when you want to move data that can't be rewritten; then you need something like device removal, which doesn't (in its current shape) play nicely with empty out tiles. It is definitely achievable though.
P.S. would it make sense to make tile-ified versions of other vdev types (keeping the same data layout and behaviors) or just implement them as part of anyraid?
anyraidz is currently in progress; we also briefly discussed anydraid at the dev summit. There's no plans for that at present, but it should be conceptually similar and not that much more work than anyraidz.
I think I understood your idea, but if I missed something let me know!
Being as anyraid essentially creates many mirror vdevs by tiling them across a drive's available space, could it some day be combined with something like vdev removal as the basis of an in-place rewrite mechanism? Something that would progressively evict data from/rewrite each mirror and it's tiles one by one. It might allow for fairly significant changes in pool layout or vdevs or parity or other parameters without needing to add an entire vdev of disks and then remove the old vdev(s). You could do a more in-place operation at the tile level on the same set of disks as long as they have free tiles.
The big restriction, as with all of this stuff, is that you can't make any changes that would change the size of the data (or anything else about it that is stored in the BP). Something like changing from mirror to raidz or adding raidz parity would require not just moving the data's physical location but its logical location and size, so that previously adjacent blocks don't start overlapping with each other. And once you're changing the logical offset or size, that requires BP rewrite.
You could do things like add and remove mirror parity, but you can already do that with attach/detach. You can do rebalance and contraction (moving whole tiles around), and indeed those are planned for phase 2 of Anyraid, as I discussed in my dev summit presentation.
Explicitly hoping to avoid BP rewrite, see below.
The tricky bit I guess would be how to select the right toplevel allocations/data for rewriting, such that a specific tile gets emptied out. (maybe you don't care which tile, as long as you end up with a whole tile being free.) Certainly easier to tolerate more fragmentation/wait longer for tiles to empty out if there's more free tiles on the drives.
If the data is rewritable (i.e. not part of a snapshot or anything), then you can just disable the metaslabs that correspond to the tiles you want to empty and then rewrite all the data using something similar to
zfs rewrite. What gets complicated is when you want to move data that can't be rewritten; then you need something like device removal, which doesn't (in its current shape) play nicely with empty out tiles. It is definitely achievable though.
Disabling the slabs and performing something like a ZFS rewrite was the direction I was thinking in.
To make a crude example, imagine you had an anyraid of two-wide mirrors, 2/5ths full, with no snapshots. Then you disabled all the slabs and triggered a ZFS rewrite, but you had flipped a switch that said all new tile allocations would go into three-wide mirrors. Free tiles are allocated into this new vdev type, and the newly created metaslabs receive the data being rewritten. When you're done, the tiles in the 2/5ths that held the data in 2-wide mirrors are empty, and the remaining 3/5ths hold the data in three-wide mirrors.* Then you can do away with the now-completely-empty tiles in the first 2/5ths and their associated mirrors and metaslabs and so on.
Basically the heart of the idea is that because tiles are significantly smaller than a disk, you could theoretically have multiple different kind of tile coexist on a disk. Then, by disabling allocations from one kind of tile and allowing new tiles of another kind, and rewriting the data, you could (assuming you have enough free tiles) do a transformation of the data via "zfs rewrite" or something like it between any two states, without BP rewrite. And from the outside this looks like an in-place transformation of the data, even though you're doing it whole tiles at a time, because nobody had to physically add or remove any disks.
The catch, as I see it, is you need to either know which tiles/metaslabs to disable given where you are in the zfs rewrite's progress, or you need to steer the zfs rewrite based on what tiles/slabs you're currently trying to empty out. This is needed if you don't have enough free space to do a full zfs rewrite of all the data in the array without reclaiming any empty tiles/metaslabs/etc until the end. If it does work, then you (theoretically) only need a few free tiles on each disk. You can allocate those into the new mirror or other vdev type, fill them with data, free the tiles that used to hold that data, and then reuse the newly freed tiles. (lather, rinse, repeat until all data has been rewritten)
Hopefully I've managed to make this at least a little but clearer.
*(let's just assume for simplicity that there were enough disks in the anyraid to make this level of redundancy possible. this would have to be checked before we start.)
If it's not possible to steer a zfs rewrite so that it empties out whole tiles, I suppose what you could do is find some way to "know" you're doing this kind of zfs rewrite* and ignore any attempts to rewrite data in tiles/metaslabs you're not currently trying to rewrite. Then just keep issuing entire zfs rewrites over and over, until eventually that set of tiles/metaslabs is empty and move on to the next set that needs to be rewritten. The number of tiles/metaslabs in each "allowed to rewrite" set could be sized to the amount of free space. More free space means a more efficient process.
*or, if possible, know that a particular operation originated from a rewrite
I suppose there's a tile-less rhyming variant of this procedure where you add a three-wide mirror to a pool with all the data in a two-wide mirror, disable the two-wide mirror, rewrite all the data, and then remove the (now empty) three-wide mirror.
I suppose there's a tile-less rhyming variant of this procedure where you add a three-wide mirror to a pool with all the data in a two-wide mirror, disable the two-wide mirror, rewrite all the data, and then remove the (now empty) three-wide mirror.
All of what you've discussed is possible, but is ultimately more complicated than it needs to be for the specific case of mirror-to-mirror conversion. For that conversion, you don't actually need to do any rewriting, you could just have a flag in the tile mapping that says "this tile is 3-way mirrrored, this tile is 2-way mirrored", and add tiles to your existing 2-way mirrors, and then use resilver to write the data into there. No rewriting necessary. For mirror-to-not-mirror conversion, this does require ZFS rewrite, since the data actually takes up more logical space when written for raidz.
Basically the heart of the idea is that because tiles are significantly smaller than a disk, you could theoretically have multiple different kind of tile coexist on a disk. Then, by disabling allocations from one kind of tile and allowing new tiles of another kind, and rewriting the data, you could (assuming you have enough free tiles) do a transformation of the data via "zfs rewrite" or something like it between any two states, without BP rewrite. And from the outside this looks like an in-place transformation of the data, even though you're doing it whole tiles at a time, because nobody had to physically add or remove any disks.
You could... but I don't know if you should. Given the restrictions (it wouldn't work on anything that's been snapshotted) and the limited use case (You want to move from mirror to raidz or vice versa but don't want to add any additional disks to store the new data format), it doesn't seem like a high priority feature. It would be possible to add in the future, but it would have a bunch of complexity in the code (you now need to have a bunch more logic be conditional on which specific tile you're operating on, rather than storing that data at the vdev level), and I don't really expect there to be that much demand for it.
But that said, it's definitely possible: Add the ability to store multiple types of tile at once, and then add coordination that disables some metaslabs, tell the vdev that all new tiles should be a different format, does a zfs rewrite on all of your data, and then deletes the now-free tiles.
I suppose there's a tile-less rhyming variant of this procedure where you add a three-wide mirror to a pool with all the data in a two-wide mirror, disable the two-wide mirror, rewrite all the data, and then remove the (now empty) three-wide mirror.
All of what you've discussed is possible, but is ultimately more complicated than it needs to be for the specific case of mirror-to-mirror conversion. For that conversion, you don't actually need to do any rewriting, you could just have a flag in the tile mapping that says "this tile is 3-way mirrrored, this tile is 2-way mirrored", and add tiles to your existing 2-way mirrors, and then use resilver to write the data into there. No rewriting necessary. For mirror-to-not-mirror conversion, this does require ZFS rewrite, since the data actually takes up more logical space when written for raidz.
Basically the heart of the idea is that because tiles are significantly smaller than a disk, you could theoretically have multiple different kind of tile coexist on a disk. Then, by disabling allocations from one kind of tile and allowing new tiles of another kind, and rewriting the data, you could (assuming you have enough free tiles) do a transformation of the data via "zfs rewrite" or something like it between any two states, without BP rewrite. And from the outside this looks like an in-place transformation of the data, even though you're doing it whole tiles at a time, because nobody had to physically add or remove any disks.
You could... but I don't know if you should. Given the restrictions (it wouldn't work on anything that's been snapshotted) and the limited use case (You want to move from mirror to raidz or vice versa but don't want to add any additional disks to store the new data format), it doesn't seem like a high priority feature. It would be possible to add in the future, but it would have a bunch of complexity in the code (you now need to have a bunch more logic be conditional on which specific tile you're operating on, rather than storing that data at the vdev level), and I don't really expect there to be that much demand for it.
But that said, it's definitely possible: Add the ability to store multiple types of tile at once, and then add coordination that disables some metaslabs, tell the vdev that all new tiles should be a different format, does a zfs rewrite on all of your data, and then deletes the now-free tiles.
Thanks for the detailed consideration!
I agree there doesn't seem to be any immediate demand based on current pain points, but in the much longer term it would make ZFS considerably more flexible. (historically the most persistent complaint by those coming from other CoW filesystems) There's growing competition on this front from projects like bcachefs, which can completely rework the structure and composition of an array with the upcoming rebalance V2. ("reconcile") [1] Now might not be the time for this work, but I think it or work like it is worth thinking about.
P.S. I didn't want to explicitly mention transformations like mirror-to-raid/raid-to-mirror (since raidz isn't currently supported by anyraid), but yes indeed that would be the kind of complete transformation that would become possible.
[1] https://www.phoronix.com/news/Bcachefs-Reconcile-Ready