daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17661 control: Maintain hugepage allocations with nvme-rebind

Open tanabarr opened this issue 8 months ago • 5 comments

The dmg storage nvme-rebind command can be used when, during non-VMD hotplug, a "new" SSD is hot-plugged into a slot that previously contained a faulty SSD. Errors related to creating a new SPDK I/O channel on dmg storage replace nvme have been attributed to the inadvertent shrinking of SPDK hugepage kernel allocations during the nvme-rebind call. This change addresses the problem by maintaining the number of hugepages allocated during nvme-rebind.

Features: control

Steps for the author:

  • [ ] Commit message follows the guidelines.
  • [ ] Appropriate Features or Test-tag pragmas were used.
  • [ ] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

tanabarr avatar Jun 09 '25 20:06 tanabarr

Ticket title is 'Command to rebind NVMe SSD to userspace driver shrinks hugepage allocation' Status is 'In Review' Labels: 'SPDK,hotplug' https://daosio.atlassian.net/browse/DAOS-17661

github-actions[bot] avatar Jun 09 '25 20:06 github-actions[bot]

Test stage Functional Hardware Medium Verbs Provider MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/2/execution/node/1382/log

daosbuild3 avatar Jun 15 '25 10:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/2/execution/node/1337/log

daosbuild3 avatar Jun 15 '25 11:06 daosbuild3

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/2/execution/node/1427/log

daosbuild3 avatar Jun 15 '25 12:06 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16493/4/testReport/

daosbuild3 avatar Jun 25 '25 14:06 daosbuild3

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/4/execution/node/1485/log

daosbuild3 avatar Jul 03 '25 09:07 daosbuild3

Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos//view/change-requests/job/PR-16493/5/testReport/

daosbuild3 avatar Jul 14 '25 18:07 daosbuild3

Apologies for force-push, I couldn't get the child PRs in the stack merged cleanly. No changes as the rebase applied clearly with no conflicts. TIA

tanabarr avatar Jul 15 '25 16:07 tanabarr

This PR is needed for non-VMD hotplug (CP req).

tanabarr avatar Jul 15 '25 16:07 tanabarr

Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/13/execution/node/307/log

daosbuild3 avatar Jul 18 '25 10:07 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/13/execution/node/306/log

daosbuild3 avatar Jul 18 '25 11:07 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16493/13/execution/node/322/log

daosbuild3 avatar Jul 18 '25 11:07 daosbuild3

https://jenkins.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16493/15/ passed all CI test stages

tanabarr avatar Jul 24 '25 13:07 tanabarr

reviews please

tanabarr avatar Jul 26 '25 20:07 tanabarr

CI run 16 passed all

tanabarr avatar Jul 31 '25 12:07 tanabarr