zfs On-demand log-spacemap flush; `zpool condense` command

[Sponsors: Klara, Inc., Wasabi Technology, Inc.]

Motivation and Context

Normally, log spacemaps are flushed out to the metaslabs when the pool is exported. For large logs, this can lead to export taking an inordinate amount of time.

This PR adds a an on-demand variant of the log spacemap flush, and a zpool condense command to trigger it. With it, an operator can request that log spacemaps be flushed ahead of time, so that there is relatively little work to be done at export time.

Description

There's two halves to this.

First, we add a "mode" to the existing "flushall" behaviour in spa_flush_metaslabs(). The traditional behaviour is now "export mode", and flushes all logs. Some new functions are added to start and stop the flush with a given mode. Then we add a "request" mode, for use by the operator. This follows the same logic of walking the logs and flushing them out, but skips any that were modified after the flush request was made.

The second part is the addition of the zpool condense command, and support library and ioctl additions. This takes a -t <target> parameter, which is the "thing" to condense, flush, garbage-collect or otherwise accelerate background processing for. It's designed so it could be wired up to any similar background process in the future. In particular, I had the dedup log in mind while putting it together.

All the trimmings you'd expect are there. Condense operations can be cancelled, restoring the flushing behaviour to its original pace or schedule. They can be waited on, via condense -w or wait -t condense. The latter combines all condense targets into one signal value, theoretically allowing multiple things to be condensed at the same time, and wait until they're all finished. Without a second or maybe even third target it's unclear to me if this it what the user will expect, but I don't think this is far off and it can be changes when the next thing gets hooked up.

I've included a kstat exposing the pool "unflushed" counters. We used this in our initial investigations. I thought it was going to be useful for the ZTS test, but it ended up being too difficult to control reliably. So nothing here uses it directly, but it doesn't hurt anything to have it there and may help someone, so why not.

How Has This Been Tested?

ztest gets support, and has had many tens of runs without issue. ZTS test has been added, and the entire suite run to successful completion.

Types of changes

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Performance enhancement (non-breaking change which improves efficiency)
[ ] Code cleanup (non-breaking change which makes code smaller or more readable)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[x] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
[x] Documentation (a change to man pages or other documentation)

Checklist:

[x] My code follows the OpenZFS code style requirements.
[x] I have updated the documentation accordingly.
[x] I have read the contributing document.
[x] I have added tests to cover my changes.
[x] I have run the ZFS Test Suite with this change applied.
[x] All commit messages are properly formatted and contain Signed-off-by.

Nov 13 '24 11:11 robn

May be I've missed something, but what will make transactions move and more metaslabs flushed if the pool is idle?

Also, if the pool is idle, will it flush only 5 transactions per transaction? I worry about the amount of new dirty metaslabs/transactions it may produce until it finally converge.

@amotin mm, you may be right. It didn't come up in testing, but we hadn't gone out of our way to stop pool activity. (Also I wrote this last year, so probably didn't know about this at the time!)

I'll study it and post an update soon, thanks!

Nov 13 '24 22:11 robn

Right, I think I've swapped back in everything I need.

So yes, you're right - when the pool is idle, nothing is pushing things along (same for dedup log, incidentally). The 5s timeout will see some flushed out, but that's all. This is existing behaviour, so I'm ok with that, I think.

So it seems to me that there's two questions.

If the operator has requested spacemap log flush, should we push the sync along a bit? I think it's reasonable to say yes, in theory. Similar idea to the dsl_scan_active() call in txg_sync_thread().

Then, what should the amount be. I forget why we chose minimum 5; probably it was just a number that was easy to see vs 1. Of course, we should use more the quieter things are. Is that just a much higher minimum? Maybe a % of the total, or of the amount dirty? Or is it more like, some larger minimum + inverse of the change rate, so we always get a big amount, and more if there's space to do it.

I'll ask around. Let me know if you have any thoughts.

Nov 14 '24 03:11 robn

@robn I think once user requested condense, we should do it as fast as possible, since user likely waits for it to reboot, export, whatever. This operation makes no sense to do just routinely. The only limitation is not hurt other workload too much.

The amount of sync is a good question. I'd guess we don't want to extend transaction group for too long, neither consume to much memory on dirty data, etc. But my last trip to spacemaps was some time ago, so I don't have specific recommendations.

Nov 14 '24 16:11 amotin