zfs icon indicating copy to clipboard operation
zfs copied to clipboard

During pool export flush the ARC asynchronously

Open don-brady opened this issue 1 year ago • 13 comments

Motivation and Context

When a pool is exported, the ARC buffers in use by that pool are flushed (evicted) as part of the export. In addition, any L2 VDEVs are removed from the L2 ARC. Both of these operations are performed sequentially and inline to the export. For larger ARC footprints, this can represent a significant amount of time. In an HA scenario, this can cause a planned failover to take longer than needed and risk timeouts on the services backed by the pool data.

Description

The teardown of the ARC data used by the pool can be done asynchronously during a pool export. For the main ARC data, the spa load GUID is used to associate a buffer with the spa so we can safely free the spa_t while the teardown proceeds in the background. For the L2 ARC VDEV, the device l2arc_dev_t has a copy of the vdev_t pointer which was being used to calculate the asize of the buffer from the psize when updating the L2 arc stats during the teardown. This asize value can be captured when the buffer is created, thereby eliminating the need for a late binding asize calculation using the VDEV during teardown.

Added an arc_flush_taskq for these background teardown tasks. In arc_fini() (e.g., during module unload) it now waits for any teardown tasks to complete.

Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc.

How Has This Been Tested?

  1. Manual testing with ARC and multiple L2 arc devices up to about 24G of arc data and ~100GB of L2 capacity. The pool export went from about 45 seconds down to 5 seconds with the asynchronous teardown in place.
  2. Manually tested exporting while a L2 rebuild was still in progress. The L2 vdev waits for the rebuild to be canceled before proceeding with the teardown.
  3. Ran various ZTS test suites, like l2arc, zpool_import, zpool_export, to exercise the changed code paths.
  4. Ran ztest

Types of changes

  • [ ] Bug fix (non-breaking change which fixes an issue)
  • [x] New feature (non-breaking change which adds functionality)
  • [ ] Performance enhancement (non-breaking change which improves efficiency)
  • [ ] Code cleanup (non-breaking change which makes code smaller or more readable)
  • [ ] Breaking change (fix or feature that would cause existing functionality to change)
  • [ ] Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • [ ] Documentation (a change to man pages or other documentation)

Checklist:

  • [x] My code follows the OpenZFS code style requirements.
  • [ ] I have updated the documentation accordingly.
  • [x] I have read the contributing document.
  • [ ] I have added tests to cover my changes.
  • [ ] I have run the ZFS Test Suite with this change applied.
  • [x] All commit messages are properly formatted and contain Signed-off-by.

don-brady avatar May 21 '24 22:05 don-brady

@gamanakis -- if you have time could you look at the L2 part of this change? Thanks

don-brady avatar Jun 03 '24 20:06 don-brady

Thanks for including me on this, on a first pass it looks good.

gamanakis avatar Jun 04 '24 09:06 gamanakis

Looks interesting, but what happen if the pool (obviously with the same GUID) is re-imported while async flush is still running?

Good question and I tested this case. The async flush will continue it's best-effort pass to flush buffers associated with the exported spa's guid. Any ARC buffers that the import uses will have a positive ref count and be skipped by the async flush task. You can think of it as an alternate arc evict thread that is targeting a specific guid with a zero ref count rather than age.

I suppose we could have the task periodically check if there is an active spa with that guid and exit the task. I'm not sure how common it is to export and then re-import right away on the same host. Before this change, you would have to wait until the export finished flushing the ARC buffers.

The ARC teardown for a spa can take multiple minutes, so you could even have a second pool export+import while the first arc_flush_task() is still running and end up with two arc_flush_task() both looking to evict candidates for the same guid. This is not fatal but a little weird.

don-brady avatar Jun 06 '24 21:06 don-brady

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

amotin avatar Jun 12 '24 13:06 amotin

Aside of it being weird, I worry that even if unlikely it is not impossible for the pool to be changed while being exported, while ARC still holds the old data.

I share this concern. Would it make sense to block importing the same pool again until eviction is complete?

allanjude avatar Jun 17 '24 00:06 allanjude

Per @amotin -- add a zpool wait for ARC teardown

don-brady avatar Jun 18 '24 20:06 don-brady

To address concerns of re-importing after the pool was changed on another host:

Save txg at export and compare to imported txg

  • if same, then cancel the teardown (ARC data is still valid)
  • If different, force import to wait for teardown to complete (ARC data can be stale)

don-brady avatar Jun 18 '24 20:06 don-brady

Thinking more about it, since ARC is indexed on DVA+birth, in case of pool export/import if some blocks appear stale, they should just never be used, unless we actually import some other pool somehow having the same GUID, or import the pool at earlier TXG. We already have somewhat similar problem with persistent L2ARC, when we load into ARC headers blocks that could be long freed from the pool, and they stay in ARC until L2ARC rotate and evict them. But in case of L2ARC we at least know that those stale blocks are from this pool and just won't be used again. I am not sure whether multiple pools with the same GUID is a realistic scenario, but importing the pool at earlier TXG I think may be more realistic, and dangerous same time, since the same TXG numbers may be reused.

amotin avatar Jun 18 '24 21:06 amotin

Update on re-import while ARC tear-down in progress

Ever since the commit that added zpool reguid feature, the ARC uses the spa_load_guid, not the spa's actual guid for identification. This load guid is transient, not persistent and will change at each import. So after the import, any blocks left in the ARC with this load guid are now orphaned and not associated with any spa.

So we don't need to worry about ARC blocks that are still around when the pool is re-imported since it will identify its blocks using a different spa_load_guid.

don-brady avatar Jun 26 '24 20:06 don-brady

@don-brady I was wondering when you were going to remember the behaviour of spa_load_guid

richardelling avatar Jul 23 '24 20:07 richardelling

Have you looked at having arc_evict prioritize evicting from flushed pools vs trying to evict buffers from active pools? @grwilson

I hadn't considered it other than if it was safe. Both arc_evict() and arc_flush() sit on top of arc_evict_state(). In the flush case it is targeting a specific spa (guid) and in the evict case it ignores targeting (i.e. when guid == 0). So arc_evict_state() is either targeting a specific guid or none at all.

We can have multiple threads (arc_evict() and multiple arc_flush_async()) all running at the same time. And in the underlying arc_evict_state() it's using a randomly selected sublist. So they will likely be working on different buffers.

(a) One option would be to have arc_evict() thread back off when there are any active arc flushes so as to give those flush-initiated evictions priority. And maybe have the last arc flush wake up the arc evict thread.

(b) Another option would be to keep a list of active flush spa guids and have the arc_evict() thread only match on guids from the list if it is not empty -- so it ends up only targeting buffers that need to be flushed.

don-brady avatar Aug 12 '24 18:08 don-brady

Rebased to fix merge conflicts.

Eviction thread nows considers active spa flushes and targets those spas (if any).

don-brady avatar Aug 23 '24 14:08 don-brady

Removed last commit and rebased to latest master branch.

don-brady avatar Sep 27 '24 19:09 don-brady

Rebased to latest master and switched to use taskq_dispatch_ent().

don-brady avatar Oct 22 '24 04:10 don-brady

fixed a leak in arc_async_flush_list() found during module unload by ztest

don-brady avatar Oct 22 '24 17:10 don-brady

Thank you for this PR, the L2ARC part looks good to me in a first pass.

gamanakis avatar Oct 22 '24 21:10 gamanakis