DAOS-17727 pool: add a new option to pool reintegration
During system maintenance windows, some ranks might be excluded due to switch reboots or network instability (while the number of failed ranks remains below the redundancy factor).
Although rebuild operations are triggered in such cases, there might be a potential data safety risk if additional ranks fail (e.g., due to SSD failures) during the rebuild process.
To mitigate this risk and avoid complexity for data recovery, we propose adding a new reintegration option(--no-migration) that skips data migration and directly brings the previously down ranks back online. Crucially, the rebuild reclaim phase must still be performed to free space from prior incomplete rebuild operations.This approach is safe only if we can confirm there is no inflight I/O during maintenance windows.
Steps for the author:
- [ ] Commit message follows the guidelines.
- [ ] Appropriate Features or Test-tag pragmas were used.
- [ ] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Proposal for Safe Rank Reintegration During Maintenance Windows' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-17727
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16538/1/testReport/
@liuxuezhao @kccain
I tested the PR using 5 ranks, created a pool with rd_fac:2 (DAOS_POOL_RF=2), then terminated two ranks. After confirming rebuild initiation via 'dmg pool query -t' (dead_ranks/disabled_ranks appeared as expected), I terminated another rank. After restarting the first two ranks, I executed "dmg pool reintegration --no-migrations --ranks <>" and observed successful rebuild completion with verified data consistency. Regarding the code behavior: When rebuild completes, update_one_tgt() handles MAP_FINISH_REBUILD cases as follows:
case MAP_FINISH_REBUILD: switch (target->ta_comp.co_status) { case PO_COMP_ST_UPIN: case PO_COMP_ST_DOWNOUT: case PO_COMP_ST_NEW: /* Nothing to do */ D_INFO(DF_MAP ": Skip FINISH_REBUILD " DF_TARGET "\n", DP_MAP(pool_uuid, map), DP_TARGET(target)); break; Since reintegration sets targets to UPIN status, skipping the update is correct. The subsequent rebuild_task_complete_schedule() properly initiates RECLAIM.
Should we return an error here, or maintain the current behavior?
Please hold on review, PR did not pass test yet.
Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16538/10/execution/node/1338/log
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16538/11/testReport/