daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17666 pool: Fix property self_heal.exclude

Open liw opened this issue 7 months ago • 8 comments

Flag "exclude" of pool property "self_heal" has no effect, as shown by the experiment described in the Jira ticket. This patch checks the flag before excluding a rank from a pool, and restarts exclusion when "self_heal" is set to include "exclude".

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liw avatar Jun 10 '25 07:06 liw

Ticket title is 'Flag exclude of pool property self_heal has no effect' Status is 'In Progress' https://daosio.atlassian.net/browse/DAOS-17666

github-actions[bot] avatar Jun 10 '25 07:06 github-actions[bot]

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/2/testReport/

daosbuild3 avatar Jun 12 '25 04:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/2/testReport/

daosbuild3 avatar Jun 12 '25 20:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/2/execution/node/1474/log

daosbuild3 avatar Jun 12 '25 21:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/3/testReport/

daosbuild3 avatar Jun 16 '25 15:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/3/execution/node/1399/log

daosbuild3 avatar Jun 16 '25 16:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/5/execution/node/1369/log

daosbuild3 avatar Jun 23 '25 16:06 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/6/testReport/

daosbuild3 avatar Jun 25 '25 02:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/6/execution/node/1460/log

daosbuild3 avatar Jul 03 '25 01:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/7/execution/node/1518/log

daosbuild3 avatar Jul 09 '25 05:07 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/8/execution/node/1540/log

daosbuild3 avatar Jul 10 '25 23:07 daosbuild3

The daos_rebuild_ec REBUILD48 fix depends on #16503. Hence, I've canceled the build job.

liw avatar Jul 11 '25 06:07 liw

The daos_rebuild_ec REBUILD48 fix depends on #16503. Hence, I've canceled the build job.

The dependency has landed; merging and resuming testing...

liw avatar Aug 02 '25 08:08 liw

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/10/execution/node/1509/log

[liw] DAOS-16762, DAOS-17867

daosbuild3 avatar Aug 03 '25 12:08 daosbuild3

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16495/12/display/redirect

daosbuild3 avatar Aug 05 '25 03:08 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16495/12/display/redirect

daosbuild3 avatar Aug 05 '25 03:08 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/13/execution/node/1222/log

daosbuild3 avatar Aug 06 '25 08:08 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/14/execution/node/1268/log

daosbuild3 avatar Aug 08 '25 07:08 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16495/16/testReport/

daosbuild3 avatar Aug 22 '25 04:08 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16495/17/execution/node/1451/log

[liw] erasurecode/online_rebuild_mdtest: DAOS-17751

daosbuild3 avatar Aug 22 '25 19:08 daosbuild3

There's one known failure, erasurecode/online_rebuild_mdtest DAOS-17751.

liw avatar Aug 25 '25 00:08 liw

Sigh, okay, I forgot to carry over the Features: rebuild from https://github.com/daos-stack/daos/pull/16495/commits/248bf01320808c9fbd8463bf2212c311a1757854 to the later merge commits. The landing process is simply too long.

liw avatar Aug 27 '25 23:08 liw

Merged master to pick up https://github.com/daos-stack/daos/pull/16777 (thanks, @kccain, for the pointer). The merge commit has

Features: test_rebuild_33 test_rebuild_34

liw avatar Sep 02 '25 02:09 liw