daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17643 control: Disallow most transitions from Excluded

Open kjacque opened this issue 7 months ago • 11 comments

Excluded ranks should be able to move only to the Joined or AdminExcluded states. Regardless of the process state, without re-joining successfully, the rank is still effectively excluded.

We saw a real life case where Excluded ranks were transitioning to the Errored state when they SIGKILLed themselves after discovering their exclusion.

Features: control

Steps for the author:

  • [x] Commit message follows the guidelines.
  • [x] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

kjacque avatar Jun 04 '25 23:06 kjacque

Ticket title is 'Excluded engine can transition to "Errored" state when it kills itself after finding it was excluded' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-17643

github-actions[bot] avatar Jun 04 '25 23:06 github-actions[bot]

Seems like a reasonable change, what about the output of dmg system query? does the "reason" field in the table still get updated when the rank terminates? requesting changes just so we can verify before landing

I still have the dmg system query -v outputs from my tests. With this PR:

3    92cfec45-8539-465c-8152-35d1ae36f949 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Excluded

Without this PR:

1    9c03dcd1-d430-4ef2-b0a0-ea73826bdbe5 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Errored DAOS engine 0 exited unexpectedly: /home/liwei/daos/install/bin/daos_engine exited: signal: killed

liw avatar Jun 05 '25 14:06 liw

Seems like a reasonable change, what about the output of dmg system query? does the "reason" field in the table still get updated when the rank terminates? requesting changes just so we can verify before landing

It looks like the Info field of a member won't be updated if the state transition is not legal: https://github.com/daos-stack/daos/blob/master/src/control/system/membership.go#L556

While I agree that it would be nice to see the additional information in the system query output, IMO that's a more invasive change to do well and it would be better to handle it as a separate task. Getting the fix in for this improper state transition is a high priority, I think.

mjmac avatar Jun 05 '25 14:06 mjmac

Seems like a reasonable change, what about the output of dmg system query? does the "reason" field in the table still get updated when the rank terminates? requesting changes just so we can verify before landing

I still have the dmg system query -v outputs from my tests. With this PR:

3    92cfec45-8539-465c-8152-35d1ae36f949 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Excluded

Without this PR:

1    9c03dcd1-d430-4ef2-b0a0-ea73826bdbe5 10.214.208.181:31415 /hsw-252.daos.hpc.amslabs.hpecorp.net Errored DAOS engine 0 exited unexpectedly: /home/liwei/daos/install/bin/daos_engine exited: signal: killed

Right so definitely a good fix which improves determinism and fixes the significant bug but we need to be aware that we are losing information important for debugging and that should be addressed as a follow-up task.

tanabarr avatar Jun 05 '25 14:06 tanabarr

So do we want a unit test forr this?

I was debating with myself on that. While we have a number of complex MS Go tests, it doesn't seem like we have any Go tests verifying this state transition. Thought of writing one for the function itself -- that doesn't exist yet either.

@liw If you have a reproducer, can you provide the steps? Maybe we could create an ftest to exercise the full stack.

kjacque avatar Jun 05 '25 18:06 kjacque

@liw If you have a reproducer, can you provide the steps? Maybe we could create an ftest to exercise the full stack.

I used a trick:

  1. Create a system of 4 engines.
  2. Create a pool on only one engine, rank X.
  3. On one server, invoke gdb, then in the GDB session, invoke attach 12345, where "12345" is the PID of the local daos_engine process, rank Y, where Y != X. This trick pauses the execution of rank Y.
  4. Wait for the paused rank Y to be excluded from the system.
  5. In the GDB session, invoke q and then type y. This resumes the execution of rank Y.
  6. Observe that the resumed rank Y detects its exclusion from the system and commits suicide.
  7. Destroy the pool.

Without this PR, in step 6 the engine state becomes "Errored", and in step 7 the pool destroy times out. Without this PR, in step 6 the engine state remains "Excluded", and in step 7 the pool destroy succeeds.

liw avatar Jun 05 '25 23:06 liw

By the way, in the scenario of my test, I think the end result of "Excluded" and no "Reason" makes sense to me. Since the "reason" comes with the "Errored" update is a consequence of "Excluded", it would be confusing to be shown as the "reason" for "Excluded".

liw avatar Jun 05 '25 23:06 liw

By the way, in the scenario of my test, I think the end result of "Excluded" and no "Reason" makes sense to me. Since the "reason" comes with the "Errored" update is a consequence of "Excluded", it would be confusing to be shown as the "reason" for "Excluded".

ok, let's leave it like that then.

tanabarr avatar Jun 06 '25 09:06 tanabarr

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/1/execution/node/1399/log

daosbuild3 avatar Jun 09 '25 18:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/1/execution/node/1413/log

daosbuild3 avatar Jun 09 '25 21:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16475/1/testReport/

daosbuild3 avatar Jun 09 '25 22:06 daosbuild3

Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/3/execution/node/364/log

daosbuild3 avatar Jul 17 '25 19:07 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/3/execution/node/367/log

daosbuild3 avatar Jul 17 '25 19:07 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/3/execution/node/326/log

daosbuild3 avatar Jul 17 '25 19:07 daosbuild3

Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/5/execution/node/339/log

daosbuild3 avatar Jul 17 '25 23:07 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/5/execution/node/338/log

daosbuild3 avatar Jul 17 '25 23:07 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/5/execution/node/352/log

daosbuild3 avatar Jul 17 '25 23:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16475/6/execution/node/1623/log

daosbuild3 avatar Jul 23 '25 02:07 daosbuild3

There is only one test failure, and it is a known issue: https://daosio.atlassian.net/browse/DAOS-17657

Requesting forced landing.

kjacque avatar Jul 23 '25 18:07 kjacque