daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17111 swim: Parse untrustable updates about self

Open liw opened this issue 7 months ago • 7 comments

When an engine starts up, after it gets its rank, but before it gets the latest system map, the engine skips parsing any incoming swim updates, because the current swim code considers all others untrustable. This makes the engine unusually vulnerable to transient suspicions:

swim_dump_updates() 2 <= 1: {2 S 2241541897092071424} {1 A
  2241415852910968832}
swim_updates_parse() 2: skip untrustable update from 1, rc = -1005

Hence, this patch lets swim parse the "untrustable" updates, react to SUSPECT and DEAD updates about the self member, but skip those about other members.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liw avatar Jun 17 '25 08:06 liw

Ticket title is '[SWIM] Zombie Node Messes Up SWIM' Status is 'Open' Labels: 'google-cloud-daos' https://daosio.atlassian.net/browse/DAOS-17111

github-actions[bot] avatar Jun 17 '25 08:06 github-actions[bot]

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16519/1/testReport/

daosbuild3 avatar Jun 17 '25 09:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16519/1/testReport/

daosbuild3 avatar Jun 19 '25 15:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16519/1/testReport/

daosbuild3 avatar Jun 20 '25 03:06 daosbuild3

Pinging reviewers...

liw avatar Jul 13 '25 23:07 liw

Pinging reviewers...

liw avatar Aug 28 '25 00:08 liw

@frostedcmos, @liuxuezhao, @jgmoore-or, ping...

liw avatar Sep 03 '25 23:09 liw