daos DAOS-17442 engine: Improve start issue handling

We infer from the available log messages that bio_xsctxt_alloc might get stuck after "failed to init bdevs: -1003" for some unknown reason related to SPDK. This patch skips two while statements in bio_xsctxt_free that might be where we got stuck.

When not in the CR mode, a pool ignores nonexistent children, leading to states that the current code cannot handle well (yet). This patch restricts this CR behavior to the CR mode. (Ideally, we shall allow the healthy children to start and automatically exclude the nonexistent children in the future.)

To make pool start failures easier to spot, this patch adds a RAS event that is raised when we fail to start a pool.

Also, this patch

fixes "-1" xstream IDs in messages logged during engine setup,
fixes a few places where we should write "pool" instead of "pool service", including the name of an internal environment variable, and
adds some infrequent log messages to help future debugging of engine start issues.

Steps for the author:

[ ] Commit message follows the guidelines.
[ ] Appropriate Features or Test-tag pragmas were used.
[ ] Appropriate Functional Test Stages were run.
[ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
[ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

[ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

May 13 '25 08:05 liw

Ticket title is 'Aurora daos_perf: Cluster is failed to start rebuild after power cycle and loosing Three ranks' Status is 'In Progress' Labels: 'ALCF,alcf_cluster,alcf_test_rebuild,alcf_track' https://daosio.atlassian.net/browse/DAOS-17442

May 13 '25 08:05 github-actions[bot]

Not sure when CI will recover; requesting reviews before CI results.

May 20 '25 05:05 liw

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/6/display/redirect

May 26 '25 17:05 daosbuild3

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/6/display/redirect

May 26 '25 17:05 daosbuild3

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/7/display/redirect

May 26 '25 18:05 daosbuild3

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/7/display/redirect

May 26 '25 18:05 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/9/execution/node/1334/log

May 27 '25 08:05 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/9/testReport/

May 27 '25 08:05 daosbuild3

Both failures seem to be due to some RPC timeouts that I can't explain.

May 28 '25 07:05 liw

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/10/execution/node/1467/log

May 31 '25 11:05 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/11/testReport/

Jun 12 '25 04:06 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/12/testReport/

Jun 23 '25 01:06 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/12/execution/node/1433/log

Jun 23 '25 06:06 daosbuild3