daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-16170 control: Ignore EngineDied event for old incarnation

Open kjacque opened this issue 7 months ago • 8 comments

It is possible to be forwarded an EngineDied event late, after the engine has re-joined. This can incorrectly re-mark the rank as Errored.

  • Include incarnation in engine-related events.
  • Print incarnation in logs if provided.
  • Do not update member if engine died event is for old incarnation.

Features: control

Steps for the author:

  • [x] Commit message follows the guidelines.
  • [x] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

kjacque avatar Jun 14 '25 00:06 kjacque

Ticket title is 'recovery/cat_recov_core.py:CatRecovCoreTest.test_daos_cat_recov_core - server was not found in its expected state - 17 TEST(S) FAILED' Status is 'In Review' Labels: 'ci-taskforce,ci_2.6_daily,ci_master_daily,daily_test,scrubbed_2.8' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-16170

github-actions[bot] avatar Jun 14 '25 00:06 github-actions[bot]

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16511/1/testReport/

daosbuild3 avatar Jun 15 '25 17:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/1/execution/node/1427/log

daosbuild3 avatar Jun 15 '25 17:06 daosbuild3

@kjacque this PR needs to run with the recovery tag/feature.

phender avatar Jun 16 '25 20:06 phender

@kjacque this PR needs to run with the recovery tag/feature.

Good catch. I rebased on master and amended the commit pragma.

kjacque avatar Jun 16 '25 23:06 kjacque

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/3/execution/node/627/log

daosbuild3 avatar Jun 18 '25 08:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/3/execution/node/672/log

daosbuild3 avatar Jun 18 '25 09:06 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/5/execution/node/1338/log

daosbuild3 avatar Jun 26 '25 14:06 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/7/execution/node/1368/log

daosbuild3 avatar Jul 04 '25 00:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/7/execution/node/1323/log

daosbuild3 avatar Jul 04 '25 04:07 daosbuild3

Test stage Functional on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16511/9/testReport/

daosbuild3 avatar Jul 26 '25 18:07 daosbuild3

probably want to mention superblock related updates somewhere in the commit message

Good idea, I added a line to the description so it can be used as the commit message.

kjacque avatar Jul 28 '25 15:07 kjacque

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/11/execution/node/1268/log

daosbuild3 avatar Aug 07 '25 11:08 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/12/execution/node/793/log

daosbuild3 avatar Aug 08 '25 00:08 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/12/execution/node/804/log

daosbuild3 avatar Aug 08 '25 04:08 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/13/execution/node/1498/log

daosbuild3 avatar Aug 09 '25 08:08 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/13/execution/node/1509/log

daosbuild3 avatar Aug 09 '25 08:08 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16511/13/testReport/

daosbuild3 avatar Aug 12 '25 05:08 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16511/14/execution/node/1464/log

daosbuild3 avatar Aug 14 '25 10:08 daosbuild3

Test failures are known issues:

  • https://daosio.atlassian.net/browse/DAOS-17888
  • https://daosio.atlassian.net/browse/DAOS-17751

I know we can't land anything until CI is up next week, but whenever things re-open, this one is ready.

kjacque avatar Aug 14 '25 20:08 kjacque