DAOS-17468 control: Prevent start if transparent hugepages are enabled
When THP feature is enabled on linux platforms, SPDK related hugepage management in DAOS performs sub-optimally. Resulting problems relate to memory accounting and fragmentation. To remedy, refuse to start daos_server if THP is enabled on platform and recommend disabling THP by applying kernel commandline parameters effective on reboot.
Features: control
Steps for the author:
- [x] Commit message follows the guidelines.
- [x] Appropriate Features or Test-tag pragmas were used.
- [x] Appropriate Functional Test Stages were run.
- [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
- [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.
After all prior steps are complete:
- [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
Ticket title is 'Prevent start if transparent hugepages are enabled' Status is 'Blocked' https://daosio.atlassian.net/browse/DAOS-17468
@ryon-jensen @JohnMalmberg can we please ensure that transparent hugepages feature is disabled on all CI test runners. if not it will create problems with DAOS and this PR will cause failures. TIA
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/7/execution/node/1095/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/7/execution/node/1086/log
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/8/execution/node/1081/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/8/execution/node/1095/log
@ryon-jensen functional tests are failing because presumably on test runner THP is enabled: https://jenkins.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16313/8/#showFailuresLink I wonder whether THP needs to be enabled on the runner? if we find situations where THP needs to be enabled e.g. VMs then we can add override flag to skip to check.
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/9/execution/node/1056/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/9/execution/node/1113/log
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/10/execution/node/1199/log
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/10/execution/node/1213/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/11/execution/node/348/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/11/execution/node/340/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/11/execution/node/393/log
Test stage Build RPM on EL 8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/12/execution/node/296/log
Test stage Build RPM on EL 9 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/12/execution/node/297/log
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/12/execution/node/310/log
Test stage Test RPMs on EL 8.6 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net/job/daos-stack/job/daos/job/PR-16313/13/display/redirect
Test stage Functional on EL 8.8 completed with status FAILURE. https://jenkins-3.daos.hpc.amslabs.hpecorp.net//job/daos-stack/job/daos/view/change-requests/job/PR-16313/13/execution/node/1316/log
@ryon-jensen @JohnMalmberg this PR is still failing because CI node running functional test stage has THP enabled: https://jenkins.daos.hpc.amslabs.hpecorp.net/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-16313/13/pipeline
We are currently do not have VM images built with THP disabled, and we don't have any reliable way to disable it based on the way that VM images are constructed. I do not know what the ETA will be on having THP disabled for VMs.