DAOS-18192 rebuild: global resource control for rebuild
Previously, the resource controls within the rebuild system—such as the limits on ULT (user-level thread) counts and DMA buffer usage—were scoped per pool.
This meant that when rebuild or migration operations occurred across multiple pools, each pool operated within its own local resource boundaries. As a result, simultaneous rebuilds in several pools could lead to excessive system-wide consumption of threads and memory, potentially impacting performance and system stability.
This patch reworks the rebuild migration resource management to introduce global limits on the number of ULTs and DMA buffers the rebuild system can use across all pools. A centralized migration resource manager is established per target to coordinate these resources across all active pools, preventing overallocation and minimizing resource contention.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Aurora daos_user: Failed to rebuild daos_user after single server crash.' Status is 'In Review' Labels: 'ALCF,scrubbed_2.8' https://daosio.atlassian.net/browse/DAOS-18192
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/2/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/3/testReport/
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17267/3/execution/node/1355/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/4/testReport/
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-17267/7/testReport/