DAOS-17442 engine: Improve start issue handling
We infer from the available log messages that bio_xsctxt_alloc might get stuck after "failed to init bdevs: -1003" for some unknown reason related to SPDK. This patch skips two while statements in bio_xsctxt_free that might be where we got stuck.
When not in the CR mode, a pool ignores nonexistent children, leading to states that the current code cannot handle well (yet). This patch restricts this CR behavior to the CR mode. (Ideally, we shall allow the healthy children to start and automatically exclude the nonexistent children in the future.)
To make pool start failures easier to spot, this patch adds a RAS event that is raised when we fail to start a pool.
Also, this patch
- fixes "-1" xstream IDs in messages logged during engine setup,
- fixes a few places where we should write "pool" instead of "pool service", including the name of an internal environment variable, and
- adds some infrequent log messages to help future debugging of engine start issues.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Aurora daos_perf: Cluster is failed to start rebuild after power cycle and loosing Three ranks' Status is 'In Progress' Labels: 'ALCF,alcf_cluster,alcf_test_rebuild,alcf_track' https://daosio.atlassian.net/browse/DAOS-17442
Not sure when CI will recover; requesting reviews before CI results.
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/6/display/redirect
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/6/display/redirect
Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/7/display/redirect
Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/7/display/redirect
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/9/execution/node/1334/log
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/9/testReport/
Both failures seem to be due to some RPC timeouts that I can't explain.
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/10/execution/node/1467/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/11/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/12/testReport/
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/12/execution/node/1433/log