zfs Bypass metaslab throttle for removal allocations

Context: We recently had a scenario where a customer with 2x10TB disks at 95+% fragmentation and capacity, wanted to migrate their disks to a 2x20TB setup. So they added the 2 new disks and submitted the removal of the first 10TB disk. The removal took a lot more than expected (order of more than a week to 2 weeks vs a couple of days) and once it was done it generated a huge indirect mappign table in RAM (~16GB vs expected ~1GB).

Root-Cause: The removal code calls metaslab_alloc_dva() to allocate a new block for each evacuating block in the removing device and it tries to batch them into 16MB segments. If it can't find such a segment it tries for 8MBs, 4MBs, all the way down to 512 bytes.

In our scenario what would happen is that metaslab_alloc_dva() from the removal thread pick the new devices initially but wouldn't allocate from them because of throttling in their metaslab allocation queue's depth (see metaslab_group_allocatable()) as these devices are new and favored for most types of allocations because of their free space. So then the removal thread would look at the old fragmented disk for allocations and wouldn't find any contiguous space and finally retry with a smaller allocation size until it would to the low KB range. This caused a lot of small mappings to be generated blowing up the size of the indirect table. It also wasted a lot of CPU while the removal was active making everything slow.

This patch: Make all allocations coming from the device removal thread bypass the throttle checks. These allocations are not even counted in the metaslab allocation queues anyway so why check them?

Side-Fix: Allocations with METASLAB_DONT_THROTTLE in their flags would not be accounted at the throttle queues but they'd still abide by the throttling rules which seems wrong. This patch fixes this by checking for that flag in metaslab_group_allocatable(). I did a quick check to see where else this flag is used and it doesn't seem like this change would cause issues.

Signed-off-by: Serapheim Dimitropoulos [email protected]

Testing: I reproduced the issue on a VM on AWS by creating a pool with 2 disks, raising their fragmentation and capacity to 90+% with randwritecomp, and adding two new disks while randwritecomp was still running. Removing one 45GB without my bits yield a mapping of 3.2MBs in memory where the average segment size copied was ~359KB. Doing the same removal with my bits yield a 239KB mapping in memory with average segment size of 4.7MB (This numbers could have been better but my new disk wasn't really that bigger so it started running out of large segments).

Nov 07 '22 21:11 sdimitro

@behlendorf I appreciate the speedy review. It seems like there are still some scenarios where we still get more mappings than expected even though this change makes the situation better. I'll experiment some more and send an update when I figure out more - until then let's keep this as a DRAFT.

Nov 11 '22 18:11 sdimitro

@behlendorf @ahrens @mmaybee - Updated to proper PR from DRAFT now that my testing is done.

Dec 01 '22 17:12 sdimitro

@amotin

The patch makes sense to me, but may be as a workaround. I am thinking whether the device removal could benefit from throttling being implemented for it (if even possible, considering large I/O sizes), or plain round-robin with this patch is good enough.

This is more of a bug fix and less of a workaround. Currently removal allocations are not accounted in the throttle but they respect the throttling logic (the METASLAB_DONT_THROTTLE flag that has the same problem is also fixed here). I do agree though that there could be more research done around whether removal could benefit from throttling.

About the original problem scenario though, after somebody got that level of fragmentation, they should without second thought create a fresh new pool and replicate the data, not practice removal, keeping data fragmentation.

This would be ideal indeed and should be also be straightforward to do for home users or company sysadmins that have full control of their setups. Unfortunately though this is not always practical for products/appliances building higher-level functionality on top of ZFS and have new data written to them constantly while striving for no downtime.

And if removal is used, I have feeling it should be possible to remove few devices same time, or at least prevent allocations from devices that are going to be removed next. I don't remember exactly how, but remember it was discussed at some point.

This is true. The actual best practice is to passivate all the devices (noalloc) that you are planning to remove so each intermediate removal's allocations don't end up on them. Unfortunately not everyone knows that practice nor users always know exactly when their next device upgrade will be.

Dec 01 '22 19:12 sdimitro

@behlendorf This should be good to go

Dec 08 '22 21:12 sdimitro

@sdimitro looking through the test results several of the builders failed the fault/decompress_fault test case. This isn't something I've seen before and appears to have been introduced by this patch. Here are the logs for one example:

http://build.zfsonlinux.org/builders/CentOS%208%20x86_64%20%28TEST%29/builds/10689/steps/shell_4/logs/summary

Dec 08 '22 22:12 behlendorf

Well, I take that back. We do appear to have seen this at least once with the master source, http://build.zfsonlinux.org/builders/Fedora%2035%20x86_64%20%28TEST%29/builds/3823. It'd just odd that we hit it so many times with this PR, I'll resubmit those builds for good measure.

Dec 08 '22 23:12 behlendorf