DAOS-17736 rebuild: exit the whole rebuild when one obj rebuild failed
When one object rebuild failed, exit the whole rebuild to avoid pool destroy timeout. After rebuild done, change rank domain's status from DOWN to DOWNOUT.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'erasurecode/online_rebuild_mdtest.py pool destroy timedout' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-17736
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16534/1/testReport/
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16534/1/display/redirect
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16534/1/display/redirect
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16534/6/testReport/
the only test failure of EC17 EC18 is due to DAOS-17656, that has been fixed but this PR's build did not merge with that version.
IMO it's generally risky to land a rebuild PR when rebuild PR tests are failing - related or not. So I will leave this to someone else