DAOS-17049 control: Allow graceful shutdown for specific ranks (#16305)
The CR checker tool needs to use a graceful shutdown when stopping ranks. It may select a subset of ranks if some are admin-excluded.
- Remove the limitation that non-forced shutdown may only be used on the whole system, not while specifying ranks.
Features: control recovery
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'recovery/ms_membership.py:MSMembershipTest.test_checker_on_admin_excluded - errors with dmg check commands' Status is 'Awaiting backport' Labels: '2.6.3rc2,2.6.3rc3,2.6.3rc4,2.6.4rc1,ci-taskforce,ci_2.6_daily,ci_master_daily,daily_test,scrubbed_2.8' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-17049
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16521/1/testReport/
Looks like the unit test changes are based on #16291, so there is a conflict without that patch.
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16521/2/testReport/
#16291 has landed to the release/2.6 branch, so I've merged that change in and run unit tests locally to verify they now pass.
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/3/execution/node/1138/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/4/execution/node/430/log
This PR is failing on an unrelated known issue during the functional test stage: https://daosio.atlassian.net/browse/DAOS-16095
I'm re-running with "Allow-unstable-test: true" to allow it to progress to the hardware stage.
Test stage Functional Hardware Medium UCX Provider completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/7/execution/node/1609/log
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16521/7/execution/node/1914/log
Test failures are known issues:
- SIGSEGV crashes for CR core tests on UCX: https://daosio.atlassian.net/browse/DAOS-16449
- EC online rebuild test timing out during rebuild: https://daosio.atlassian.net/issues/DAOS-17751
The verbs version of the CR core tests passed, as did the test this is intended to fix. This will resolve a test failure on the 2.6 branch.