daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-4183 engine: reduce virtual memory and swap footprint

Open bfaccini opened this issue 2 years ago • 31 comments

In order to limit the virtual memory and swap footprint, only mmap() the exact requested stack size (to be rounded up to the page size by the Kernel!), different sizes will be managed by a b-tree, and MAP_NORESERVE flag will now be used.

Required-githooks: true

Signed-off-by: Bruno Faccini [email protected]

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [ ] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [ ] Commit messages follows the guidelines outlined here.
  • [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate watchers.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficent testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

bfaccini avatar Nov 08 '22 22:11 bfaccini

Bug-tracker data: Ticket title is 'io-server segfaults when pmdk built with ndctl' Status is 'In Review' Labels: 'q4_fix,triaged' https://daosio.atlassian.net/browse/DAOS-4183

github-actions[bot] avatar Nov 08 '22 22:11 github-actions[bot]

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/1/execution/node/145/log

daosbuild1 avatar Nov 08 '22 22:11 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/1/execution/node/1083/log

daosbuild1 avatar Nov 10 '22 09:11 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/3/execution/node/146/log

daosbuild1 avatar Nov 12 '22 19:11 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/4/execution/node/146/log

daosbuild1 avatar Nov 15 '22 16:11 daosbuild1

!!! this PR has permitted runs with "dedup:memcmp" properties to become successful on Frontera , instead to fail with ENOMEM before .....

bfaccini avatar Nov 15 '22 16:11 bfaccini

Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/4/execution/node/343/log

daosbuild1 avatar Nov 15 '22 16:11 daosbuild1

Test stage Build RPM on Leap 15 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/4/execution/node/299/log

daosbuild1 avatar Nov 15 '22 16:11 daosbuild1

Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/4/execution/node/338/log

daosbuild1 avatar Nov 15 '22 16:11 daosbuild1

Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/4/execution/node/439/log

daosbuild1 avatar Nov 15 '22 16:11 daosbuild1

Test stage Build on Leap 15 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/4/execution/node/478/log

daosbuild1 avatar Nov 15 '22 16:11 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/5/execution/node/145/log

daosbuild1 avatar Nov 15 '22 18:11 daosbuild1

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/6/execution/node/146/log

daosbuild1 avatar Nov 15 '22 22:11 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/6/execution/node/1084/log

daosbuild1 avatar Nov 17 '22 10:11 daosbuild1

There was only 2 tests in error during last CI session :

Existing failures - 2 Test Hardware / Functional Hardware Medium / 3-./osa/online_extend.py:OSAOnlineExtend.test_osa_online_extend_oclass;run-aggregation-checksum-container-daos_racer-extra_servers-hosts-ior-client_processes-iorflags-job_manager-loop_test-mdtest-wr_size-32K-pool-rebuild-server_config-engines-0-storage-0-1-setup-test_obj_class-test_ranks-dc35 – FTEST_osa.OSAOnlineExtend -->> Time-out already being addressed by DAOS-12054

Test Hardware / Functional Hardware Medium / 1-./scrubber/target_auto_eviction.py:TestWithScrubberTargetEviction.test_scrubber_ssd_auto_eviction;run-agent_config-transport_config-container-dmg-faults-hosts-ior-client_processes-pool-server_config-engines-0-storage-0-1-setup-9bf0 – FTEST_scrubber.TestWithScrubberTargetEviction -->> Time-out already being addressed by DAOS-11950

Is this ok to not rerun CI (and thus not add more in Jenkins job queue...) ??

bfaccini avatar Nov 17 '22 14:11 bfaccini

@johannlombardi @NiuYawei can you review when you have some time ?! thx in advance ;-)

bfaccini avatar Nov 17 '22 17:11 bfaccini

One comment, it looks like this enables the feature by default too. Should that be in the message?

jolivier23 avatar Dec 15 '22 23:12 jolivier23

One comment, it looks like this enables the feature by default too. Should that be in the message?

Oops, right @jolivier23 , I forgot that I had enabled it to expose it to the full CI testing... Should I remove the specific change that enables by default (is everybody ok to enable the mmap()ing of ULTs stacks by default ?) ? Or just change the main PR msg to indicate it as you have suggested ?

bfaccini avatar Dec 16 '22 16:12 bfaccini

One comment, it looks like this enables the feature by default too. Should that be in the message?

Oops, right @jolivier23 , I forgot that I had enabled it to expose it to the full CI testing... Should I remove the specific change that enables by default (is everybody ok to enable the mmap()ing of ULTs stacks by default ?) ? Or just change the main PR msg to indicate it as you have suggested ?

I'd be ok with just changing the description but it's probably a question for @johannlombardi whether we should enable the feature by default.

jolivier23 avatar Dec 16 '22 21:12 jolivier23

One comment, it looks like this enables the feature by default too. Should that be in the message?

Oops, right @jolivier23 , I forgot that I had enabled it to expose it to the full CI testing... Should I remove the specific change that enables by default (is everybody ok to enable the mmap()ing of ULTs stacks by default ?) ? Or just change the main PR msg to indicate it as you have suggested ?

I'd be ok with just changing the description but it's probably a question for @johannlombardi whether we should enable the feature by default.

@johannlombardi I know you previously reviewed and approved but just wanted to make sure you were ok with the change of default in particular for this feature.

jolivier23 avatar Jan 17 '23 20:01 jolivier23

I am ok with the patch and to eventually change it. The issue is that we still got a perf impact on frontera for IOPS benchmarks IIRC. If so, we should address this before enabling it by default.

johannlombardi avatar Jan 24 '23 07:01 johannlombardi

I am ok with the patch and to eventually change it. The issue is that we still got a perf impact on frontera for IOPS benchmarks IIRC. If so, we should address this before enabling it by default.

Ah, you got new+bad perf numbers from Dalton ? Will push a new commit to remove default enabling...

bfaccini avatar Jan 25 '23 14:01 bfaccini

Test stage checkpatch completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/7/execution/node/174/log

daosbuild1 avatar Jan 25 '23 14:01 daosbuild1

Test stage Scan Leap 15.4 RPMs completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/7/execution/node/890/log

daosbuild1 avatar Jan 25 '23 15:01 daosbuild1

One more review turn-table, sorry guys .... To be honest I don't remember last time when I forgot to ask you reviewing again :-(

bfaccini avatar Feb 09 '23 22:02 bfaccini

Even if mmap()'ed ULTs stacks feature seems to introduce some penalty, I would like to get this PR to land since this feature can be used at least for debugging purpose. I would like to get some feedback and hear what do all my reviewers think about this ??

bfaccini avatar Aug 07 '23 17:08 bfaccini

Even if mmap()'ed ULTs stacks feature seems to introduce some penalty, I would like to get this PR to land since this feature can be used at least for debugging purpose. I would like to get some feedback and hear what do all my reviewers think about this ??

I think if it's disabled by default, or at least for release builds, maybe it's okay to land?

daltonbohning avatar Aug 07 '23 18:08 daltonbohning

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/10/execution/node/1267/log

daosbuild1 avatar Sep 25 '23 20:09 daosbuild1

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-10808/10/execution/node/1313/log

daosbuild1 avatar Sep 25 '23 21:09 daosbuild1

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-10808/10/display/redirect

daosbuild1 avatar Sep 27 '23 08:09 daosbuild1