daos icon indicating copy to clipboard operation
daos copied to clipboard

DAOS-14550 control: Prevent SPDK calls if disable_hugepages true in c…

Open tanabarr opened this issue 1 year ago • 6 comments

…fg (#13605)

Avoid calling into SPDK if disable_hugepages is set in the server config file. This change prevents access during StorageScan, StorageFormat, StorageNvmeRebind and StorageNvmeAddDevice server ControlService handlers and when cleaning hugepages during start-up.

Also:

  • Improve fault information messages
  • Improve skip related messaging during format
  • Cover missing config in storage command handler unit tests
  • Fix quirks in storage command display output
  • Improve unit test coverage of storage pretty printers

Required-githooks: true

Before requesting gatekeeper:

  • [ ] Two review approvals and any prior change requests have been resolved.
  • [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • [x] Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • [x] Commit messages follows the guidelines outlined here.
  • [x] Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • [ ] You are the appropriate gatekeeper to be landing the patch.
  • [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • [ ] Githooks were used. If not, request that user install them and check copyright dates.
  • [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • [ ] All builds have passed. Check non-required builds for any new compiler warnings.
  • [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • [ ] If applicable, the PR has addressed any potential version compatibility issues.
  • [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • [ ] Extra checks if forced landing is requested
    • [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • [ ] No new NLT or valgrind warnings. Check the classic view.
    • [ ] Quick-build or Quick-functional is not used.
  • [ ] Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

tanabarr avatar Feb 06 '24 15:02 tanabarr

Bug-tracker data: Ticket title is 'Remove all SPDK activity calls if hugepages disabled in config' Status is 'Awaiting backport' Labels: 'scrubbed,triaged' Job should run at elevated priority (1) https://daosio.atlassian.net/browse/DAOS-14550

github-actions[bot] avatar Feb 06 '24 15:02 github-actions[bot]

Please, could you tell me which part of the code has been changed from the cherry-pick.

knard38 avatar Feb 07 '24 08:02 knard38

Please, could you tell me which part of the code has been changed from the cherry-pick.

There were quite a few changes due to the difference in the 2 branches wrt bdev scan.

╰─➤  diff -u <(git show --stat 8116293bdde5a67acdda6e175efb252722df6245) <(git show --stat 2d71a9c93271a5ada9bdab81f167cd0115359599)
--- /proc/self/fd/11    2024-02-06 23:17:36.751794148 +0000
+++ /proc/self/fd/13    2024-02-06 23:17:36.751794148 +0000
@@ -1,6 +1,6 @@
-commit 8116293bdde5a67acdda6e175efb252722df6245
+commit 2d71a9c93271a5ada9bdab81f167cd0115359599
 Author: Tom Nabarro <[email protected]>
-Date:   Wed Jan 31 16:53:50 2024 +0000
+Date:   Tue Feb 6 15:09:05 2024 +0000

     DAOS-14550 control: Prevent SPDK calls if disable_hugepages true in cfg (#13605)

@@ -16,20 +16,21 @@
       * Fix quirks in storage command display output
       * Improve unit test coverage of storage pretty printers

+    Features: control
+    Required-githooks: true
+
     Signed-off-by: Tom Nabarro <[email protected]>

- src/control/cmd/dmg/pretty/storage.go           |   5 +-
- src/control/cmd/dmg/pretty/storage_nvme.go      |  22 +--
- src/control/cmd/dmg/pretty/storage_scm.go       |  13 +-
- src/control/cmd/dmg/pretty/storage_test.go      | 111 +++++++++++--
- src/control/fault/code/codes.go                 |   5 +-
- src/control/server/config/faults.go             |   8 +-
- src/control/server/config/server.go             |   2 +-
- src/control/server/config/server_test.go        |   4 +-
- src/control/server/ctl_storage_rpc.go           | 107 +++++++++---
- src/control/server/ctl_storage_rpc_test.go      | 209 ++++++++++++++++++++++--
- src/control/server/faults.go                    |   7 +-
- src/control/server/instance_storage_rpc.go      |   2 +-
- src/control/server/instance_storage_rpc_test.go |  17 +-
- src/control/server/server_utils.go              |  10 +-
- 14 files changed, 422 insertions(+), 100 deletions(-)
+ src/control/cmd/dmg/pretty/storage.go      |   5 +-
+ src/control/cmd/dmg/pretty/storage_nvme.go |  23 ++--
+ src/control/cmd/dmg/pretty/storage_scm.go  |  13 +-
+ src/control/cmd/dmg/pretty/storage_test.go | 111 +++++++++++++++--
+ src/control/fault/code/codes.go            |   5 +-
+ src/control/server/config/faults.go        |   8 +-
+ src/control/server/config/server.go        |   2 +-
+ src/control/server/config/server_test.go   |   4 +-
+ src/control/server/ctl_storage_rpc.go      | 113 ++++++++++++++----
+ src/control/server/ctl_storage_rpc_test.go | 186 ++++++++++++++++++++++++++---
+ src/control/server/faults.go               |   7 +-
+ src/control/server/server_utils.go         |  10 +-
+ 12 files changed, 403 insertions(+), 84 deletions(-)

Shows that the main differences between the 2 commits are in ctl_storage_rpc{,_test}.go. Does that help?

tanabarr avatar Feb 07 '24 10:02 tanabarr

Is this patch necessary for 2.4? Given its size and complexity, it seems somewhat risky. If I understand correctly, the main goal is to correctly avoid failures in bdev scan if the configuration has disabled hugepage allocation. Unless this was specifically requested as a backport for the 2.4.x releases, I don't see a problem with just noting that this won't work correctly and that it will be fixed in 2.6+.

mjmac avatar Feb 07 '24 19:02 mjmac

Is this patch necessary for 2.4? Given its size and complexity, it seems somewhat risky. If I understand correctly, the main goal is to correctly avoid failures in bdev scan if the configuration has disabled hugepage allocation. Unless this was specifically requested as a backport for the 2.4.x releases, I don't see a problem with just noting that this won't work correctly and that it will be fixed in 2.6+.

fine by me, looks like it's missed 2.4.2 anyway

tanabarr avatar Feb 07 '24 21:02 tanabarr

Is this patch necessary for 2.4? Given its size and complexity, it seems somewhat risky. If I understand correctly, the main goal is to correctly avoid failures in bdev scan if the configuration has disabled hugepage allocation. Unless this was specifically requested as a backport for the 2.4.x releases, I don't see a problem with just noting that this won't work correctly and that it will be fixed in 2.6+.

fine by me, looks like it's missed 2.4.2 anyway

@tanabarr @mjmac Do we want to close this then?

daltonbohning avatar Mar 15 '24 17:03 daltonbohning

Closing as per @mjmac 's suggestion

tanabarr avatar Mar 21 '24 11:03 tanabarr