daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17736 rebuild: exit the whole rebuild when one obj rebuild failed

Open liuxuezhao opened this issue 8 months ago • 4 comments

When one object rebuild failed, exit the whole rebuild to avoid pool destroy timeout. After rebuild done, change rank domain's status from DOWN to DOWNOUT.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liuxuezhao avatar Jun 24 '25 10:06 liuxuezhao

Ticket title is 'erasurecode/online_rebuild_mdtest.py pool destroy timedout' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-17736

github-actions[bot] avatar Jun 24 '25 10:06 github-actions[bot]

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16534/1/testReport/

daosbuild3 avatar Jun 24 '25 10:06 daosbuild3

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16534/1/display/redirect

daosbuild3 avatar Jun 24 '25 12:06 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16534/1/display/redirect

daosbuild3 avatar Jun 24 '25 17:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16534/6/testReport/

daosbuild3 avatar Jul 12 '25 01:07 daosbuild3

the only test failure of EC17 EC18 is due to DAOS-17656, that has been fixed but this PR's build did not merge with that version.

liuxuezhao avatar Jul 14 '25 01:07 liuxuezhao

IMO it's generally risky to land a rebuild PR when rebuild PR tests are failing - related or not. So I will leave this to someone else

daltonbohning avatar Jul 16 '25 15:07 daltonbohning