daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-18192 rebuild: global resource control for rebuild

Open gnailzenh opened this issue 1 month ago • 5 comments

Previously, the resource controls within the rebuild system—such as the limits on ULT (user-level thread) counts and DMA buffer usage—were scoped per pool.

This meant that when rebuild or migration operations occurred across multiple pools, each pool operated within its own local resource boundaries. As a result, simultaneous rebuilds in several pools could lead to excessive system-wide consumption of threads and memory, potentially impacting performance and system stability.

This patch reworks the rebuild migration resource management to introduce global limits on the number of ULTs and DMA buffers the rebuild system can use across all pools. A centralized migration resource manager is established per target to coordinate these resources across all active pools, preventing overallocation and minimizing resource contention.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

gnailzenh avatar Dec 12 '25 13:12 gnailzenh

Ticket title is 'Aurora daos_user: Failed to rebuild daos_user after single server crash.' Status is 'In Review' Labels: 'ALCF,scrubbed_2.8' https://daosio.atlassian.net/browse/DAOS-18192

github-actions[bot] avatar Dec 12 '25 13:12 github-actions[bot]

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/2/testReport/

daosbuild3 avatar Dec 12 '25 14:12 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/3/testReport/

daosbuild3 avatar Dec 12 '25 15:12 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17267/3/execution/node/1355/log

daosbuild3 avatar Dec 12 '25 23:12 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/4/testReport/

daosbuild3 avatar Dec 13 '25 10:12 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/7/testReport/

daosbuild3 avatar Dec 18 '25 09:12 daosbuild3