DAOS-17305 mgmt: Skip pools during engine start
Add an "internal" environment variable, DAOS_POOL_BLACKLIST, for skipping problematic pools in emergency situations.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Skip pools during engine start' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-17305
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16452/3/display/redirect
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16452/5/execution/node/1338/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16452/5/display/redirect
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16452/5/testReport/
LGTM. As I reviewed, a question occurred to me. Could the logic to process the black list environment variable value be done in the pool module (maybe at module .sm_init time too?)? Idea being that the the existing pool_svc_failed_list could be reused. i.e., append blacklisted pools to that list, and check for a pool in that list to determine whether to start that pool. Probably there is a very good reason to keep this logic in the mgmt module that I am not considering, so no change requested.
@kccain, to be honest, I didn't think of the failed list when it came to storing the parsing result of the new environment variable, probably because, subconsciously, I regarded the blacklist as an input to the mgmt pool iteration code, and the failed list felt more like a result, rather than an input, of the iteration. Now that I think about it, using the failed list as an input may also work. (One hesitation I have is that I've been thinking of replacing the failed list with some new ds_pool states, and prefer not to add new meaning to the failed list if possible.) Please let me know your thoughts.
@kccain, to be honest, I didn't think of the failed list when it came to storing the parsing result of the new environment variable, probably because, subconsciously, I regarded the blacklist as an input to the
mgmtpool iteration code, and the failed list felt more like a result, rather than an input, of the iteration. Now that I think about it, using the failed list as an input may also work. (One hesitation I have is that I've been thinking of replacing the failed list with some newds_poolstates, and prefer not to add new meaning to the failed list if possible.) Please let me know your thoughts.
Sounds good to keep it in the current form.
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16452/7/testReport/
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16452/7/display/redirect
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16452/7/display/redirect
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16452/9/execution/node/736/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16452/10/execution/node/1507/log
@daos-stack/daos-gatekeeper, ping.