daos DAOS-16979 control: Reduce frequency of hugepage allocation at runtime

Reduce the frequency of hugepage allocation change requests made to the kernel. During daos_server start-up check total hugepages and only request a change from kernel if the recommended number calculated based on server config file content is more than the existing system total. Most importantly; never shrink an allocation.

The result of this change should be to reduce the chance of hugepage memory fragmentation by reducing the frequency of kernel hugepage allocations. This should in turn reduce the chances of DMA buffer allocations failing due to memory fragmentation.

SPDK setup script doesn't (in v22.01) have an option to skip hugepage allocation whilst performing NVMe driver rebinding. To work around this, use per-NUMA meminfo to request current values when calling into script in the case that no changes are required. This essentially results in a no-op rather than shrinking all growing the allocation.

Main changes in SetNrHugepages, SetHugeNodes, setEngineMemSize.

Before requesting gatekeeper:

[x] Two review approvals and any prior change requests have been resolved.
[x] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
[x] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
[x] Commit messages follows the guidelines outlined here.
[x] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

[ ] You are the appropriate gatekeeper to be landing the patch.
[ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
[ ] Githooks were used. If not, request that user install them and check copyright dates.
[ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
[ ] All builds have passed. Check non-required builds for any new compiler warnings.
[ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
[ ] If applicable, the PR has addressed any potential version compatibility issues.
[ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
[ ] Extra checks if forced landing is requested
- [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
- [ ] No new NLT or valgrind warnings. Check the classic view.
- [ ] Quick-build or Quick-functional is not used.
[ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Feb 05 '25 11:02 tanabarr

Ticket title is 'Mitigation against hugepage memory fragmentation' Status is 'In Review' Labels: 'SPDK' https://daosio.atlassian.net/browse/DAOS-16979

Feb 05 '25 11:02 github-actions[bot]

@phender as discussed in order to try to reproduce the DMA grow failure (DAOS-16979) related to hugepage fragmentation I ran this PR with Test-tag-hw-medium: pr daily_regression to try to trigger the failure. Unfortunately https://build.hpdd.intel.com/blue/organizations/jenkins/daos-stack%2Fdaos/detail/PR-15848/1/pipeline/ didn't hit the issue. any other ideas on how to get a baseline to prove a fix? or any other approaches like just landing a fix and seeing if it has the desired result? I'm going to try to generate a local reproducer as well.

Feb 11 '25 14:02 tanabarr

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration.

If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

Feb 26 '25 12:02 tanabarr

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1420/log

Feb 26 '25 22:02 daosbuild1

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration. If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

Feb 26 '25 22:02 daltonbohning

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1565/log

Feb 27 '25 09:02 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/2/execution/node/1518/log

Feb 27 '25 12:02 daosbuild1

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration. If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR.

Feb 27 '25 14:02 mjmac

If I understand the ticket and the PR changes correctly, this patch aims to reduce CI instability by reserving a hard-coded maximum amount of hugepages at startup, regardless of the number of bdevs in the configuration. If that is correct, then I'm not sure I agree with the approach. Changing the product code to work around quirks of the CI testing process seems wrong to me. Instead, it seems like it would be best to ensure that the first configuration of the server has the maximum number of hugepages allocated for the rest of the run. This is a CI problem rather than a product problem, and therefore the solution should be implemented in the test harness, IMO.

I don't think this issue is restricted just to CI, IIRC this has been seen outside of our test infrastructure. @NiuYawei requested this change so maybe it's appropriate that he responds to your objection.

FWIW I've seen similar on Aurora after a fresh reboot: https://daosio.atlassian.net/browse/DAOS-16921?focusedCommentId=135440 And I've only seen that with master, not 2.6.

I don't doubt that there is a problem... My concern is more that it seems like the actual problem is not yet understood, and the proposed approach in this PR is a potential solution for a very specific set of scenarios. Adding a hard-coded configuration for hugepages kind of defeats the purpose of having a configuration mechanism, and it seems likely to cause unintended problems for configurations that are outside of what's being hard-coded in this PR.

Yes, there could be other unknown issues to be solved (As @daltonbohning mentioned that allocation failure was seen after a fresh reboot when the memory isn't supposed to be fragmented), but allocating hugepages at run time (setting nr_hugepages) is believed likely generating fragmentations.

I think our goal is to avoid allocating hugepages at run time when possible, no matter for production or testing system.

Feb 27 '25 15:02 NiuYawei

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1220/log

Feb 28 '25 14:02 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1429/log

Feb 28 '25 15:02 daosbuild1

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1522/log

Feb 28 '25 23:02 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/3/execution/node/1569/log

Mar 01 '25 05:03 daosbuild1

I think tests have probably finished but Jenkins seems to be inaccessible because of certificate issue.

Mar 03 '25 11:03 tanabarr

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/4/execution/node/1572/log

Mar 03 '25 20:03 daosbuild1

So if I'm understanding the motivation correctly, the problem is that people sometimes change the storage/engine config in a way that affects hugepages after booting up the system, and then restart the daos_server? Is that correct, or am I missing something?

Yes memory can be fragmented causing DMA buffer allocations to fail if frequent reallocation of hugepages happens. Reallocation occurs on daos_server start-up (currently every time) and advice from kernel focus is to allocate once to reduce the chance of fragmentation https://docs.kernel.org/admin-guide/mm/hugetlbpage.html .

Always allocating a large number of hugepages immediately after reboot feels a bit weird to me. And it seems peculiar for a user to be twiddling those configuration knobs beyond the initial system setup/tuning--is this something that really happens?

I guess that the counterpoint to this question is that this may well be what the user does for one reason or another and if so then we want to try to prevent memory fragmentation from causing the engine to fail. One way to do this is as implemented where we allocate a large amount on first start in order to prevent reallocation requests in the majority of cases, I can't see any major downsides to doing this vs. what we have currently.

Mar 05 '25 15:03 tanabarr

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/6/execution/node/1176/log

Mar 05 '25 21:03 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/6/execution/node/1430/log

Mar 05 '25 22:03 daosbuild1

Yes memory can be fragmented causing DMA buffer allocations to fail if frequent reallocation of hugepages happens. Reallocation occurs on daos_server start-up (currently every time) and advice from kernel focus is to allocate once to reduce the chance of fragmentation https://docs.kernel.org/admin-guide/mm/hugetlbpage.html .

Sure, I get that part. Is the fragmentation potentially happening even if the user didn't change their config? I guess that's what I'm trying to understand, what user behavior (if any) triggers this issue.

I guess that the counterpoint to this question is that this may well be what the user does for one reason or another and if so then we want to try to prevent memory fragmentation from causing the engine to fail. One way to do this is as implemented where we allocate a large amount on first start in order to prevent reallocation requests in the majority of cases, I can't see any major downsides to doing this vs. what we have currently.

The main concern I have is that a lot of memory is set aside for potentially-unnecessary hugepages. A hard-coded 34GB is a lot to reserve if the instance of DAOS actually needs, say, 16GB, or even less. Once allocated, those pages aren't available for system memory use anymore, right? It just seems excessive as a default.

Mar 06 '25 02:03 kjacque

Sure, I get that part. Is the fragmentation potentially happening even if the user didn't change their config? I guess that's what I'm trying to understand, what user behavior (if any) triggers this issue.

FWIW I've seen the DMA allocation failures on Aurora on a fresh cluster, fresh reboot of the nodes. And on master but not 2.6

Mar 06 '25 15:03 daltonbohning

FWIW I've seen the DMA allocation failures on Aurora on a fresh cluster, fresh reboot of the nodes. And on master but not 2.6

That implies to me that there's a bug unrelated to the issue this patch addresses. Also odd that it only happens in master, when 2.6 is also handling hugepages the old way.

Maybe there are multiple issues causing similar symptoms. I think the basic idea of this patch (only allocate hugepages if they haven't already been allocated by a previous DAOS run) makes sense regardless. We don't want to reallocate hugepages if daos_server needs to be restarted. I just don't like the high default. In odd cases where the number of targets could change (testing/tuning/experimenting), I think the user should probably set nr_hugepages to give themselves wiggle room.

Mar 06 '25 20:03 kjacque

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/6/execution/node/1570/log

Mar 06 '25 21:03 daosbuild1

FWIW I've seen the DMA allocation failures on Aurora on a fresh cluster, fresh reboot of the nodes. And on master but not 2.6

That implies to me that there's a bug unrelated to the issue this patch addresses. Also odd that it only happens in master, when 2.6 is also handling hugepages the old way.

Maybe there are multiple issues causing similar symptoms. I think the basic idea of this patch (only allocate hugepages if they haven't already been allocated by a previous DAOS run) makes sense regardless. We don't want to reallocate hugepages if daos_server needs to be restarted. I just don't like the high default. In odd cases where the number of targets could change (testing/tuning/experimenting), I think the user should probably set nr_hugepages to give themselves wiggle room.

Yeah, I agree there could be some other issues besides the potential fragmentation caused by the unnecessary hugepages re-allocation. One wild guess is that kernel might allocate hugepages for the two NUMA nodes in sort of interleaved mode? So that allocated pages for certain NUMA could be fragmented? @tanabarr , any thoughts? I remembered that we switched to use NUMA aware hugepage allocation in setup.sh (engine was also changed to use NUMA aware DMA alloc API) in an earlier version (probably in 2.4? Or 2.6?), I'm wondering if that's something we need to investigate further?

The reason it happens in master only is that I bumped initial DMA allocation from a fixed number (24 chunks per xstream) to 50% of total DMA buffer (64 chunks per xstream by default) in master (it's not backported to 2.6), I made this change because we observed run time DMA allocation failures on both Aurora and CI testing, the failure won't interrupt application but it does hurt performance on high workload which requires more DMA buffers, so I decided to pre-allocate more on engine start.

@tanabarr and I discussed this offline, his proposal is that we just prevents allocation if total system number is more than either calculated or what is specified in the config, and we ask admin (or change CI test/image) to allocate enough hugepages on boot to avoid fragmentations. @kjacque Does it sound reasonable to you?

Mar 07 '25 02:03 NiuYawei

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/6/execution/node/1554/log

Mar 07 '25 03:03 daosbuild1

@tanabarr and I discussed this offline, his proposal is that we just prevents allocation if total system number is more than either calculated or what is specified in the config, and we ask admin (or change CI test/image) to allocate enough hugepages on boot to avoid fragmentations. @kjacque Does it sound reasonable to you?

This makes sense to me.

We should update the admin guide and any quickstart documentation, and maybe the server config file as well, to make sure users are aware of the potential issue with hugepage fragmentation and our recommendations to mitigate it.

Mar 07 '25 23:03 kjacque

FWIW I've seen the DMA allocation failures on Aurora on a fresh cluster, fresh reboot of the nodes. And on master but not 2.6

That implies to me that there's a bug unrelated to the issue this patch addresses. Also odd that it only happens in master, when 2.6 is also handling hugepages the old way. Maybe there are multiple issues causing similar symptoms. I think the basic idea of this patch (only allocate hugepages if they haven't already been allocated by a previous DAOS run) makes sense regardless. We don't want to reallocate hugepages if daos_server needs to be restarted. I just don't like the high default. In odd cases where the number of targets could change (testing/tuning/experimenting), I think the user should probably set nr_hugepages to give themselves wiggle room.

Yeah, I agree there could be some other issues besides the potential fragmentation caused by the unnecessary hugepages re-allocation. One wild guess is that kernel might allocate hugepages for the two NUMA nodes in sort of interleaved mode? So that allocated pages for certain NUMA could be fragmented? @tanabarr , any thoughts? I remembered that we switched to use NUMA aware hugepage allocation in setup.sh (engine was also changed to use NUMA aware DMA alloc API) in an earlier version (probably in 2.4? Or 2.6?), I'm wondering if that's something we need to investigate further?

The reason it happens in master only is that I bumped initial DMA allocation from a fixed number (24 chunks per xstream) to 50% of total DMA buffer (64 chunks per xstream by default) in master (it's not backported to 2.6), I made this change because we observed run time DMA allocation failures on both Aurora and CI testing, the failure won't interrupt application but it does hurt performance on high workload which requires more DMA buffers, so I decided to pre-allocate more on engine start.

@tanabarr and I discussed this offline, his proposal is that we just prevents allocation if total system number is more than either calculated or what is specified in the config, and we ask admin (or change CI test/image) to allocate enough hugepages on boot to avoid fragmentations. @kjacque Does it sound reasonable to you?

I guess we have to see whether we can advocate across NUMA on boot, I imagine we can. let's see the results of the PR first

Mar 10 '25 22:03 tanabarr

What I'm seeing in https://daosio.atlassian.net/browse/DAOS-16921 is - as previously stated - it only occurs on master, but it also only occurs when running the tests on the wolf-[52,122-125] cluster (Label wolf-52_nvme5). Due its hardware configuration this cluster runs with the ofi+tcp provider, instead of verbs and the same 4 tests always fail when run on this cluster.

This can be targeted in CI by setting:

Test-tag: PoolCreateAllHwTests PoolCreateCapacityTests RbldNoCapacity RbldWithIO
FUNCTIONAL_HARDWARE_MEDIUM_LABEL = wolf-52_nvme5

Mar 11 '25 13:03 phender

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/7/execution/node/267/log

Apr 03 '25 00:04 daosbuild1

Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/7/execution/node/243/log

Apr 03 '25 01:04 daosbuild1

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15848/8/execution/node/1129/log

Apr 04 '25 00:04 daosbuild1