daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-16895 control: Show pool in degraded state only when rebuild busy

Open tanabarr opened this issue 7 months ago • 10 comments

The pool state of “Degraded” is easily misinterpreted as meaning “not perfect data protection/rebuild not completed”, which is its typical meaning in storage environemnts. What it really means here (at least in the most typical scenario) is that some targets are excluded. Rebuild state is tracked in another property.

As an immediate step, fix this by setting/displaying the pool state as “Degraded” only when rebuild is active.

Features: control

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

tanabarr avatar Jun 10 '25 16:06 tanabarr

Ticket title is 'Rename pool state "Degraded" to "TargetsExcluded"' Status is 'In Progress' Labels: 'LRZ' https://daosio.atlassian.net/browse/DAOS-16895

github-actions[bot] avatar Jun 10 '25 16:06 github-actions[bot]

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16497/2/testReport/

daosbuild3 avatar Jun 19 '25 10:06 daosbuild3

Updated PR to change state from Degraded to TargetsExcluded to prevent conveying mis-information.

tanabarr avatar Jun 19 '25 11:06 tanabarr

@daltonbohning @phender regarding test references to "Degraded" I think I've updated the relevant string references but do you think we need to change any of the test names with regard to this change? e.g. DAOS_Degraded_Mode or daos_degraded.c

tanabarr avatar Jun 19 '25 11:06 tanabarr

@daltonbohning @phender regarding test references to "Degraded" I think I've updated the relevant string references but do you think we need to change any of the test names with regard to this change? e.g. DAOS_Degraded_Mode or daos_degraded.c

I think that depends on whether Degraded still makes sense in general to the devs writing those tests.

daltonbohning avatar Jun 23 '25 16:06 daltonbohning

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16497/3/execution/node/1123/log

daosbuild3 avatar Jun 25 '25 09:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16497/4/execution/node/754/log

daosbuild3 avatar Jun 25 '25 19:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16497/5/execution/node/441/log

daosbuild3 avatar Jun 26 '25 08:06 daosbuild3

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16497/6/execution/node/441/log

daosbuild3 avatar Jun 26 '25 15:06 daosbuild3

CI runs passed apart from 5 unrelated EC failures which are presumably intermittent issues.

tanabarr avatar Jun 26 '25 22:06 tanabarr

CI run nr 8 passed so requesting landing...

tanabarr avatar Jul 10 '25 13:07 tanabarr