DAOS-17643 control: Disallow most transitions from Excluded
Excluded ranks should be able to move only to the Joined or AdminExcluded states. Regardless of the process state, without re-joining successfully, the rank is still effectively excluded.
We saw a real life case where Excluded ranks were transitioning to the Errored state when they SIGKILLed themselves after discovering their exclusion.
Features: control
Steps for the author:
- [x] Commit message follows the guidelines.
- [x] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Excluded engine can transition to "Errored" state when it kills itself after finding it was excluded' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-17643
Seems like a reasonable change, what about the output of
dmg system query? does the "reason" field in the table still get updated when the rank terminates? requesting changes just so we can verify before landing
I still have the dmg system query -v outputs from my tests. With this PR:
3 92cfec45-8539-465c-8152-35d1ae36f949 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Excluded
Without this PR:
1 9c03dcd1-d430-4ef2-b0a0-ea73826bdbe5 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Errored DAOS engine 0 exited unexpectedly: /home/liwei/daos/install/bin/daos_engine exited: signal: killed
Seems like a reasonable change, what about the output of
dmg system query? does the "reason" field in the table still get updated when the rank terminates? requesting changes just so we can verify before landing
It looks like the Info field of a member won't be updated if the state transition is not legal: https://github.com/daos-stack/daos/blob/master/src/control/system/membership.go#L556
While I agree that it would be nice to see the additional information in the system query output, IMO that's a more invasive change to do well and it would be better to handle it as a separate task. Getting the fix in for this improper state transition is a high priority, I think.
Seems like a reasonable change, what about the output of
dmg system query? does the "reason" field in the table still get updated when the rank terminates? requesting changes just so we can verify before landingI still have the
dmg system query -voutputs from my tests. With this PR:3 92cfec45-8539-465c-8152-35d1ae36f949 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net ExcludedWithout this PR:
1 9c03dcd1-d430-4ef2-b0a0-ea73826bdbe5 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Errored DAOS engine 0 exited unexpectedly: /home/liwei/daos/install/bin/daos_engine exited: signal: killed
Right so definitely a good fix which improves determinism and fixes the significant bug but we need to be aware that we are losing information important for debugging and that should be addressed as a follow-up task.
So do we want a unit test forr this?
I was debating with myself on that. While we have a number of complex MS Go tests, it doesn't seem like we have any Go tests verifying this state transition. Thought of writing one for the function itself -- that doesn't exist yet either.
@liw If you have a reproducer, can you provide the steps? Maybe we could create an ftest to exercise the full stack.
@liw If you have a reproducer, can you provide the steps? Maybe we could create an ftest to exercise the full stack.
I used a trick:
- Create a system of 4 engines.
- Create a pool on only one engine, rank X.
- On one server, invoke
gdb, then in the GDB session, invokeattach 12345, where "12345" is the PID of the local daos_engine process, rank Y, where Y != X. This trick pauses the execution of rank Y. - Wait for the paused rank Y to be excluded from the system.
- In the GDB session, invoke
qand then typey. This resumes the execution of rank Y. - Observe that the resumed rank Y detects its exclusion from the system and commits suicide.
- Destroy the pool.
Without this PR, in step 6 the engine state becomes "Errored", and in step 7 the pool destroy times out. Without this PR, in step 6 the engine state remains "Excluded", and in step 7 the pool destroy succeeds.
By the way, in the scenario of my test, I think the end result of "Excluded" and no "Reason" makes sense to me. Since the "reason" comes with the "Errored" update is a consequence of "Excluded", it would be confusing to be shown as the "reason" for "Excluded".
By the way, in the scenario of my test, I think the end result of "Excluded" and no "Reason" makes sense to me. Since the "reason" comes with the "Errored" update is a consequence of "Excluded", it would be confusing to be shown as the "reason" for "Excluded".
ok, let's leave it like that then.
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/1/execution/node/1399/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/1/execution/node/1413/log
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16475/1/testReport/
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/3/execution/node/364/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/3/execution/node/367/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/3/execution/node/326/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/5/execution/node/339/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/5/execution/node/338/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/5/execution/node/352/log
Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/6/execution/node/1623/log
There is only one test failure, and it is a known issue: https://daosio.atlassian.net/browse/DAOS-17657
Requesting forced landing.