daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17736 rebuild: exit the whole rebuild when one obj rebuild failed

Open liuxuezhao opened this issue 8 months ago • 1 comments

When one object rebuild failed, exit the whole rebuild to avoid pool destroy timeout. After rebuild done, change rank domain's status from DOWN to DOWNOUT.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liuxuezhao avatar Jun 26 '25 10:06 liuxuezhao

Ticket title is 'erasurecode/online_rebuild_mdtest.py pool destroy timedout' Status is 'Awaiting backport' Labels: 'scrubbed_2.6.5' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-17736

github-actions[bot] avatar Jun 26 '25 10:06 github-actions[bot]

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16543/2/testReport/

daosbuild3 avatar Jul 10 '25 21:07 daosbuild3

Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16543/4/execution/node/1531/log

daosbuild3 avatar Jul 28 '25 13:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16543/5/execution/node/1392/log

daosbuild3 avatar Oct 14 '25 04:10 daosbuild3

Test stage Functional Hardware Large completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16543/5/execution/node/1528/log

daosbuild3 avatar Oct 14 '25 05:10 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16543/8/execution/node/334/log

daosbuild3 avatar Nov 24 '25 03:11 daosbuild3

Test stage Build RPM on EL 8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16543/8/execution/node/360/log

daosbuild3 avatar Nov 24 '25 03:11 daosbuild3

@daos-stack/daos-gatekeeper clang format only one warning is

  • return pool_update_map_internal(pool_uuid, MAP_FINISH_REBUILD, true, list, NULL, NULL,
  •   			NULL, NULL, reclaim_ver, NULL);
    
  • return pool_update_map_internal(pool_uuid, MAP_FINISH_REBUILD, true, list, NULL, NULL, NULL,

I can change it in a following PR to avoid CI re-test.

liuxuezhao avatar Dec 01 '25 07:12 liuxuezhao