daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17843 rebuild: fix potential migrate ULT counter leak

Open liuxuezhao opened this issue 3 months ago • 13 comments

After add the counter, the ULT possibly has not been scheduled and then the rebuild be aborted that caused the migrate_pool_tls be destroyed by migrate_fini_one_ult, that will cause the migrate_obj_ult/migrate_one_ult cannot drop the ULT counter and further cause the rebuild cannot be treated as complete due to non-zero total_ult_cnt. This PR fix it by pass the ult counter pointer to migrate ULT so need not depend on migrate_pool_tls lookup to drop the counter.

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

liuxuezhao avatar Nov 19 '25 04:11 liuxuezhao

Ticket title is 'Pool rebuild stuck in pulling for 5+ hours' Status is 'In Progress' Labels: 'ALCF,hpe_cluster' https://daosio.atlassian.net/browse/DAOS-17843

github-actions[bot] avatar Nov 19 '25 04:11 github-actions[bot]

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/301/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/317/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/1/execution/node/309/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/302/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Build on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/318/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Build on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/310/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Build on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-17149/2/execution/node/406/log

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Unit Test bdev with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage NLT on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Unit Test bdev on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

Test stage Unit Test with memcheck on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-17149/3/display/redirect

daosbuild3 avatar Nov 19 '25 04:11 daosbuild3

The changes to decrement the counts look good. I'm not totally following though where in the code a rebuild with nonzero ULT count(s) would be treated as not complete (also does that mean it would hang?)

This PR cannot fix the problem, is not going to land.

liuxuezhao avatar Nov 25 '25 02:11 liuxuezhao