daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17049 control: Allow graceful shutdown for specific ranks (#16305)

Open kjacque opened this issue 8 months ago • 6 comments

The CR checker tool needs to use a graceful shutdown when stopping ranks. It may select a subset of ranks if some are admin-excluded.

  • Remove the limitation that non-forced shutdown may only be used on the whole system, not while specifying ranks.

Features: control recovery

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

kjacque avatar Jun 17 '25 15:06 kjacque

Ticket title is 'recovery/ms_membership.py:MSMembershipTest.test_checker_on_admin_excluded - errors with dmg check commands' Status is 'Awaiting backport' Labels: '2.6.3rc2,2.6.3rc3,2.6.3rc4,2.6.4rc1,ci-taskforce,ci_2.6_daily,ci_master_daily,daily_test,scrubbed_2.8' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-17049

github-actions[bot] avatar Jun 17 '25 15:06 github-actions[bot]

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16521/1/testReport/

daosbuild3 avatar Jun 17 '25 16:06 daosbuild3

Looks like the unit test changes are based on #16291, so there is a conflict without that patch.

kjacque avatar Jun 17 '25 19:06 kjacque

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16521/2/testReport/

daosbuild3 avatar Jun 25 '25 15:06 daosbuild3

#16291 has landed to the release/2.6 branch, so I've merged that change in and run unit tests locally to verify they now pass.

kjacque avatar Jun 27 '25 17:06 kjacque

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/3/execution/node/1138/log

daosbuild3 avatar Jun 27 '25 22:06 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/4/execution/node/430/log

daosbuild3 avatar Jun 28 '25 00:06 daosbuild3

This PR is failing on an unrelated known issue during the functional test stage: https://daosio.atlassian.net/browse/DAOS-16095

I'm re-running with "Allow-unstable-test: true" to allow it to progress to the hardware stage.

kjacque avatar Jun 30 '25 20:06 kjacque

Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/7/execution/node/1609/log

daosbuild3 avatar Jul 02 '25 04:07 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/7/execution/node/1914/log

daosbuild3 avatar Jul 02 '25 06:07 daosbuild3

Test failures are known issues:

  • SIGSEGV crashes for CR core tests on UCX: https://daosio.atlassian.net/browse/DAOS-16449
  • EC online rebuild test timing out during rebuild: https://daosio.atlassian.net/issues/DAOS-17751

The verbs version of the CR core tests passed, as did the test this is intended to fix. This will resolve a test failure on the 2.6 branch.

kjacque avatar Jul 02 '25 18:07 kjacque