daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17727 pool: add a new option to pool reintegration

Open wangshilong opened this issue 6 months ago • 2 comments

During system maintenance windows, some ranks might be excluded due to switch reboots or network instability (while the number of failed ranks remains below the redundancy factor).

Although rebuild operations are triggered in such cases, there might be a potential data safety risk if additional ranks fail (e.g., due to SSD failures) during the rebuild process.

To mitigate this risk and avoid complexity for data recovery, we propose adding a new reintegration option(--no-migration) that skips data migration and directly brings the previously down ranks back online. Crucially, the rebuild reclaim phase must still be performed to free space from prior incomplete rebuild operations.This approach is safe only if we can confirm there is no inflight I/O during maintenance windows.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

wangshilong avatar Jun 25 '25 04:06 wangshilong

Ticket title is 'Proposal for Safe Rank Reintegration During Maintenance Windows' Status is 'In Review' https://daosio.atlassian.net/browse/DAOS-17727

github-actions[bot] avatar Jun 25 '25 04:06 github-actions[bot]

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16538/1/testReport/

daosbuild3 avatar Jun 25 '25 04:06 daosbuild3

@liuxuezhao @kccain

I tested the PR using 5 ranks, created a pool with rd_fac:2 (DAOS_POOL_RF=2), then terminated two ranks. After confirming rebuild initiation via 'dmg pool query -t' (dead_ranks/disabled_ranks appeared as expected), I terminated another rank. After restarting the first two ranks, I executed "dmg pool reintegration --no-migrations --ranks <>" and observed successful rebuild completion with verified data consistency. Regarding the code behavior: When rebuild completes, update_one_tgt() handles MAP_FINISH_REBUILD cases as follows:

case MAP_FINISH_REBUILD: switch (target->ta_comp.co_status) { case PO_COMP_ST_UPIN: case PO_COMP_ST_DOWNOUT: case PO_COMP_ST_NEW: /* Nothing to do */ D_INFO(DF_MAP ": Skip FINISH_REBUILD " DF_TARGET "\n", DP_MAP(pool_uuid, map), DP_TARGET(target)); break; Since reintegration sets targets to UPIN status, skipping the update is correct. The subsequent rebuild_task_complete_schedule() properly initiates RECLAIM.

Should we return an error here, or maintain the current behavior?

wangshilong avatar Jul 07 '25 13:07 wangshilong

Please hold on review, PR did not pass test yet.

wangshilong avatar Jul 08 '25 11:07 wangshilong

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16538/10/execution/node/1338/log

daosbuild3 avatar Jul 09 '25 05:07 daosbuild3

Test stage NLT on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16538/11/testReport/

daosbuild3 avatar Jul 09 '25 11:07 daosbuild3