daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-14679 pool: Report on stopping sp_stopping

Open liw opened this issue 1 year ago • 1 comments

When destroying a pool on an engine (as part of a pool destroy command), we may block in

ds_pool_stop
  pool_put_sync // putting the last reference
    pool_free_ref // called after the hash rec deletion
      pool_child_delete_one
        pool_child_stop

waiting for some ds_pool_child references. If the user retries the pool destroy command, the new ds_pool_stop call on this engine can't find the ds_pool object that is in the sp_stopping state, and the new pool destroy command usually succeeds, even though the storage capacity allocated to the pool hasn't been released yet.

This patch makes the following changes:

  • Move the pool_child_delete_one collective call from pool_free_ref to before the ds_pool_put call in ds_pool_stop. This makes sure that the ds_pool object remains in the LRU cache in the sp_stopping state while we wait for ds_pool_child references (or some other upper layer resource).

  • Remove the pool_put_sync trick, which was introduced because of the pool_child_delete_one collective call in pool_free_ref.

  • Return an error from ds_pool_stop when the ds_pool object is in the sp_stopping state, so that the new pool destroy command in the aforementioned scenario will return an explicit error.

  • Register a reply aggregation callback for MGMT_TGT_DESTROY so that an error can reach the control plane.

Features: pool

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

liw avatar May 15 '24 08:05 liw

Ticket title is 'LRZ: dmg pool destroy fails' Status is 'Blocked' Labels: 'LRZ,triaged' https://daosio.atlassian.net/browse/DAOS-14679

github-actions[bot] avatar May 15 '24 08:05 github-actions[bot]

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14374/3/testReport/

daosbuild1 avatar May 17 '24 18:05 daosbuild1

container/boundary: DAOS-15857

liw avatar May 20 '24 01:05 liw

Merged with master to fix a minor conflict in daos_srv/pool.h.

liw avatar May 21 '24 05:05 liw

Perhaps you should add

Allow-unstable-test: true

So it at least runs all hardware tests if it's blocked by a CI bug on VM tests.

jolivier23 avatar May 24 '24 14:05 jolivier23

Merged master again because the previous CI job was stuck in "Test RPMs".

liw avatar May 28 '24 05:05 liw

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-14374/8/display/redirect

daosbuild1 avatar May 28 '24 05:05 daosbuild1