daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17442 engine: Improve start issue handling

Open liw opened this issue 8 months ago • 13 comments

We infer from the available log messages that bio_xsctxt_alloc might get stuck after "failed to init bdevs: -1003" for some unknown reason related to SPDK. This patch skips two while statements in bio_xsctxt_free that might be where we got stuck.

When not in the CR mode, a pool ignores nonexistent children, leading to states that the current code cannot handle well (yet). This patch restricts this CR behavior to the CR mode. (Ideally, we shall allow the healthy children to start and automatically exclude the nonexistent children in the future.)

To make pool start failures easier to spot, this patch adds a RAS event that is raised when we fail to start a pool.

Also, this patch

  • fixes "-1" xstream IDs in messages logged during engine setup,
  • fixes a few places where we should write "pool" instead of "pool service", including the name of an internal environment variable, and
  • adds some infrequent log messages to help future debugging of engine start issues.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liw avatar May 13 '25 08:05 liw

Ticket title is 'Aurora daos_perf: Cluster is failed to start rebuild after power cycle and loosing Three ranks' Status is 'In Progress' Labels: 'ALCF,alcf_cluster,alcf_test_rebuild,alcf_track' https://daosio.atlassian.net/browse/DAOS-17442

github-actions[bot] avatar May 13 '25 08:05 github-actions[bot]

Not sure when CI will recover; requesting reviews before CI results.

liw avatar May 20 '25 05:05 liw

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/6/display/redirect

daosbuild3 avatar May 26 '25 17:05 daosbuild3

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/6/display/redirect

daosbuild3 avatar May 26 '25 17:05 daosbuild3

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/7/display/redirect

daosbuild3 avatar May 26 '25 18:05 daosbuild3

Test stage Unit Test on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16376/7/display/redirect

daosbuild3 avatar May 26 '25 18:05 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/9/execution/node/1334/log

daosbuild3 avatar May 27 '25 08:05 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/9/testReport/

daosbuild3 avatar May 27 '25 08:05 daosbuild3

Both failures seem to be due to some RPC timeouts that I can't explain.

liw avatar May 28 '25 07:05 liw

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/10/execution/node/1467/log

daosbuild3 avatar May 31 '25 11:05 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/11/testReport/

daosbuild3 avatar Jun 12 '25 04:06 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16376/12/testReport/

daosbuild3 avatar Jun 23 '25 01:06 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16376/12/execution/node/1433/log

daosbuild3 avatar Jun 23 '25 06:06 daosbuild3