sriov-network-operator icon indicating copy to clipboard operation
sriov-network-operator copied to clipboard

feat: implement MlxResetFW to reset the FW on VF changes

Open tobiasgiese opened this issue 1 year ago • 30 comments
trafficstars

Todos:

  • [x] implement new method to mstfwreset
  • [x] add feature flag to opt-in/opt-out
  • [x] add unit test
  • [x] test it manually or via e2e test job (preferred)

Manual test shows that the implementation is working:

sriov-network-config-daemon 2024-07-11T14:39:36.793959572Z    INFO    mellanox/mellanox_plugin.go:190 mellanox-plugin: resetFW()      {"cmd-args": ["-d", "0000:03:00.0", "-l", "3", "-y", "reset"]}
foo@bar:~$ sudo mlxconfig -d /dev/mst/mt4119_pciconf0 q | grep -e SRIOV_EN -e NUM_OF_VFS
        NUM_OF_VFS                                  4
        SRIOV_EN                                    True(1)

tobiasgiese avatar Jul 11 '24 14:07 tobiasgiese

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 11 '24 14:07 github-actions[bot]

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 12 '24 11:07 github-actions[bot]

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 12 '24 15:07 github-actions[bot]

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 15 '24 14:07 github-actions[bot]

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 15 '24 14:07 github-actions[bot]

Pull Request Test Coverage Report for Build 10179580977

Details

  • 1 of 63 (1.59%) changed or added relevant lines in 6 files are covered.
  • 6 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.2%) to 43.816%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/daemon.go 1 7 14.29%
pkg/helper/mock/mock_helper.go 0 10 0.0%
pkg/plugins/mellanox/mellanox_plugin.go 0 10 0.0%
pkg/vendors/mellanox/mock/mock_mellanox.go 0 10 0.0%
cmd/sriov-network-config-daemon/start.go 0 13 0.0%
pkg/vendors/mellanox/mellanox.go 0 13 0.0%
<!-- Total: 1 63
Files with Coverage Reduction New Missed Lines %
controllers/drain_controller.go 1 68.06%
pkg/plugins/mellanox/mellanox_plugin.go 5 7.64%
<!-- Total: 6
Totals Coverage Status
Change from base Build 10178846427: -0.2%
Covered Lines: 6536
Relevant Lines: 14917

💛 - Coveralls

coveralls avatar Jul 15 '24 14:07 coveralls

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 15 '24 16:07 github-actions[bot]

/test-all

tobiasgiese avatar Jul 15 '24 16:07 tobiasgiese

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 16 '24 07:07 github-actions[bot]

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 16 '24 07:07 github-actions[bot]

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 16 '24 08:07 github-actions[bot]

/test-all

tobiasgiese avatar Jul 16 '24 09:07 tobiasgiese

Thanks for your PR, To run vendors CIs use one of:

  • /test-all: To run all tests for all vendors.
  • /test-e2e-all: To run all E2E tests for all vendors.
  • /test-e2e-nvidia-all: To run all E2E tests for NVIDIA vendor.

To skip the vendors CIs use one of:

  • /skip-all: To skip all tests for all vendors.
  • /skip-e2e-all: To skip all E2E tests for all vendors.
  • /skip-e2e-nvidia-all: To skip all E2E tests for NVIDIA vendor. Best regards.

github-actions[bot] avatar Jul 16 '24 10:07 github-actions[bot]

@zeeke @adrianchiris the GH Action ci-triggers change seems to work. After a rebase and push no new comment has been added. But after closing and reopening there is one. 👍🏻

tobiasgiese avatar Jul 16 '24 10:07 tobiasgiese

/test-all

tobiasgiese avatar Jul 16 '24 11:07 tobiasgiese

/test-all

adrianchiris avatar Jul 16 '24 11:07 adrianchiris

/test-all

adrianchiris avatar Jul 17 '24 11:07 adrianchiris

/test-e2e-nvidia-all

adrianchiris avatar Jul 17 '24 12:07 adrianchiris

/test-e2e-nvidia-all

adrianchiris avatar Jul 17 '24 13:07 adrianchiris

/test-e2e-nvidia-all

tobiasgiese avatar Jul 18 '24 15:07 tobiasgiese

/test-e2e-nvidia-all

tobiasgiese avatar Jul 19 '24 11:07 tobiasgiese

Nvidia e2e test depends on https://github.com/k8snetworkplumbingwg/sriov-network-operator/pull/742

tobiasgiese avatar Jul 19 '24 15:07 tobiasgiese

/test-all

tobiasgiese avatar Jul 22 '24 11:07 tobiasgiese

/test-e2e-nvidia-all

tobiasgiese avatar Jul 23 '24 08:07 tobiasgiese

/test-all

tobiasgiese avatar Jul 24 '24 13:07 tobiasgiese

and if we reset the primary nic of the system we will not have an API connection to reconcile and we can leave the system in a bad state...

Usually the reset of the fw should work. Also the code of the mstfwrest binary is tested. Additionally, we have a feature gate top opt-in, which means this is not enabled for the default usage. IMHO we should rely on the robustness of the reset. Or have you already had experience with the reset and that's why concerns? WDYT @adrianchiris

tobiasgiese avatar Jul 29 '24 08:07 tobiasgiese

Hi can you please rebase this one

SchSeba avatar Aug 04 '24 07:08 SchSeba

LGTM from my sided. just needs rebase

adrianchiris avatar Aug 04 '24 08:08 adrianchiris

Rebased, thanks for the hint. Didn't noticed the wrong commits

tobiasgiese avatar Aug 05 '24 05:08 tobiasgiese

/test-all

tobiasgiese avatar Aug 05 '24 05:08 tobiasgiese