daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17600 mgmt: Fix reint error handling

Open liw opened this issue 11 months ago • 11 comments

Remove the "cleanup" error handling in ds_mgmt_tgt_pool_create_ranks for now, because in the reintegration case the pool may already exists on the rank before the reintegration. For the pool create case, add the cleanup in ds_mgmt_create_pool. Add the "failout" flag to CoRPC MGMT_TGT_CREATE to avoid leaking pools on ranks being reintegrated, since MGMT_TGT_CREATE doesn't need to execute on as many ranks as possible upon errors.

Fix a ds_mgmt_hdlr_tgt_create error path that forgets to roll back the record inserted to hash table dpt_creates_ht.

Tune the logging to help future debugging in this area.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liw avatar May 26 '25 08:05 liw

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16432/1/testReport/

daosbuild3 avatar May 26 '25 22:05 daosbuild3

Ticket title is 'Reintegration error handling issues' Status is 'In Progress' Labels: 'scrubbed_2.8' https://daosio.atlassian.net/browse/DAOS-17600

github-actions[bot] avatar May 27 '25 00:05 github-actions[bot]

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/4/execution/node/1348/log

daosbuild3 avatar May 27 '25 09:05 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16432/4/testReport/

daosbuild3 avatar May 27 '25 10:05 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/7/execution/node/1338/log

daosbuild3 avatar May 31 '25 00:05 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/7/execution/node/1478/log

daosbuild3 avatar May 31 '25 07:05 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/7/execution/node/1433/log

daosbuild3 avatar May 31 '25 07:05 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16432/8/testReport/

daosbuild3 avatar Jun 09 '25 13:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16432/8/testReport/

daosbuild3 avatar Jun 10 '25 14:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16432/9/testReport/

daosbuild3 avatar Jun 13 '25 08:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/9/execution/node/1513/log

daosbuild3 avatar Jun 14 '25 01:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/10/execution/node/1448/log

daosbuild3 avatar Jul 02 '25 15:07 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16432/11/execution/node/1455/log

daosbuild3 avatar Jul 08 '25 23:07 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16432/13/testReport/

daosbuild3 avatar Jul 14 '25 01:07 daosbuild3