daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17843 rebuild: fix potential migrate ULT counter leak

Open liuxuezhao opened this issue 3 months ago • 3 comments

After add the counter, the ULT possibly has not been scheduled and then the rebuild be aborted that caused the migrate_pool_tls be destroyed by migrate_fini_one_ult, that will cause the migrate_obj_ult/migrate_one_ult cannot drop the ULT counter and further cause the rebuild cannot be treated as complete due to non-zero total_ult_cnt. This PR fix it by pass the ult counter pointer to migrate ULT so need not depend on migrate_pool_tls lookup to drop the counter.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liuxuezhao avatar Nov 19 '25 03:11 liuxuezhao

Ticket title is 'Pool rebuild stuck in pulling for 5+ hours' Status is 'In Progress' Labels: 'ALCF,hpe_cluster' https://daosio.atlassian.net/browse/DAOS-17843

github-actions[bot] avatar Nov 19 '25 03:11 github-actions[bot]

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17146/2/display/redirect

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17146/3/display/redirect

daosbuild3 avatar Nov 19 '25 05:11 daosbuild3