daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-17468 control: Prevent start if transparent hugepages are enabled

Open tanabarr opened this issue 8 months ago • 25 comments

When THP feature is enabled on linux platforms, SPDK related hugepage management in DAOS performs sub-optimally. Resulting problems relate to memory accounting and fragmentation. To remedy, refuse to start daos_server if THP is enabled on platform and recommend disabling THP by applying kernel commandline parameters effective on reboot.

Features: control

Steps for the author:

  • [x] Commit message follows the guidelines.
  • [x] Appropriate Features or Test-tag pragmas were used.
  • [x] Appropriate Functional Test Stages were run.
  • [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
  • [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

After all prior steps are complete:

  • [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).

tanabarr avatar Apr 25 '25 11:04 tanabarr

Ticket title is 'Prevent start if transparent hugepages are enabled' Status is 'Blocked' https://daosio.atlassian.net/browse/DAOS-17468

github-actions[bot] avatar Apr 25 '25 11:04 github-actions[bot]

@ryon-jensen @JohnMalmberg can we please ensure that transparent hugepages feature is disabled on all CI test runners. if not it will create problems with DAOS and this PR will cause failures. TIA

tanabarr avatar May 16 '25 12:05 tanabarr

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/7/execution/node/1095/log

daosbuild3 avatar Jun 02 '25 17:06 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/7/execution/node/1086/log

daosbuild3 avatar Jun 02 '25 18:06 daosbuild3

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/8/execution/node/1081/log

daosbuild3 avatar Jun 13 '25 12:06 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/8/execution/node/1095/log

daosbuild3 avatar Jun 13 '25 14:06 daosbuild3

@ryon-jensen functional tests are failing because presumably on test runner THP is enabled: https://jenkins.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16313/8/#showFailuresLink I wonder whether THP needs to be enabled on the runner? if we find situations where THP needs to be enabled e.g. VMs then we can add override flag to skip to check.

tanabarr avatar Jun 15 '25 21:06 tanabarr

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/9/execution/node/1056/log

daosbuild3 avatar Jun 24 '25 17:06 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/9/execution/node/1113/log

daosbuild3 avatar Jun 24 '25 22:06 daosbuild3

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/10/execution/node/1199/log

daosbuild3 avatar Jul 11 '25 13:07 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/10/execution/node/1213/log

daosbuild3 avatar Jul 11 '25 15:07 daosbuild3

Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/11/execution/node/348/log

daosbuild3 avatar Jul 17 '25 17:07 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/11/execution/node/340/log

daosbuild3 avatar Jul 17 '25 18:07 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/11/execution/node/393/log

daosbuild3 avatar Jul 17 '25 18:07 daosbuild3

Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/12/execution/node/296/log

daosbuild3 avatar Jul 18 '25 10:07 daosbuild3

Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/12/execution/node/297/log

daosbuild3 avatar Jul 18 '25 10:07 daosbuild3

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/12/execution/node/310/log

daosbuild3 avatar Jul 18 '25 10:07 daosbuild3

Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16313/13/display/redirect

daosbuild3 avatar Jul 21 '25 11:07 daosbuild3

Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/13/execution/node/1316/log

daosbuild3 avatar Jul 21 '25 13:07 daosbuild3

@ryon-jensen @JohnMalmberg this PR is still failing because CI node running functional test stage has THP enabled: https://jenkins.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-16313/13/pipeline

tanabarr avatar Jul 21 '25 17:07 tanabarr

We are currently do not have VM images built with THP disabled, and we don't have any reliable way to disable it based on the way that VM images are constructed. I do not know what the ETA will be on having THP disabled for VMs.

JohnMalmberg avatar Jul 22 '25 17:07 JohnMalmberg