DAOS-17666 pool: Fix property self_heal.exclude
Flag "exclude" of pool property "self_heal" has no effect, as shown by the experiment described in the Jira ticket. This patch checks the flag before excluding a rank from a pool, and restarts exclusion when "self_heal" is set to include "exclude".
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Flag exclude of pool property self_heal has no effect'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-17666
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/2/testReport/
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/2/testReport/
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/2/execution/node/1474/log
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/3/testReport/
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/3/execution/node/1399/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/5/execution/node/1369/log
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/6/testReport/
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/6/execution/node/1460/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/7/execution/node/1518/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/8/execution/node/1540/log
The daos_rebuild_ec REBUILD48 fix depends on #16503. Hence, I've canceled the build job.
The daos_rebuild_ec REBUILD48 fix depends on #16503. Hence, I've canceled the build job.
The dependency has landed; merging and resuming testing...
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/10/execution/node/1509/log
[liw] DAOS-16762, DAOS-17867
Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16495/12/display/redirect
Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16495/12/display/redirect
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/13/execution/node/1222/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/14/execution/node/1268/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/16/testReport/
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/17/execution/node/1451/log
[liw] erasurecode/online_rebuild_mdtest: DAOS-17751
There's one known failure, erasurecode/online_rebuild_mdtest DAOS-17751.
Sigh, okay, I forgot to carry over the Features: rebuild from https://github.com/daos-stack/daos/pull/16495/commits/248bf01320808c9fbd8463bf2212c311a1757854 to the later merge commits. The landing process is simply too long.
Merged master to pick up https://github.com/daos-stack/daos/pull/16777 (thanks, @kccain, for the pointer). The merge commit has
Features: test_rebuild_33 test_rebuild_34