harvester icon indicating copy to clipboard operation
harvester copied to clipboard

[BUG] Can't Enable SR-IOV GPU in UI

Open noahgildersleeve opened this issue 2 years ago • 2 comments

Describe the bug

You can't enable SR-IOV GPU in the UI

To Reproduce Steps to reproduce the behavior:

  1. Enable both pcidevices-controller and nvidia-driver-toolkit addon on a system with a GPU installed
  2. Navigate to SR-IOV GPU Devices in the left sidebar
  3. Enable one of the GPUS by clicking the three dots on the right and selecting enable Expected behavior

The GPU should enable and be available as an option in vGPU devices

Support bundle

supportbundle_b580a14b-9c0c-465c-ad49-dcdde8cfbb84_2024-02-06T01-37-41Z.zip

Environment

  • Harvester ISO version: v1.3.0-rc1
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 2 node DL360 bare metal

Additional context

Found while testing #2764 . Greenshot 2024-02-05 17 59 31

It's worth noting that on this cluster there are 2 nodes. Both have GPUs, but one them shouldn't support this feature. The issue might be related to the feature check on the chipset.

noahgildersleeve avatar Feb 06 '24 02:02 noahgildersleeve

cc @ibrokethecloud

khushboo-rancher avatar Feb 06 '24 23:02 khushboo-rancher

There are 2 issues:

  1. nvidia driver toolkit addon needs a UI enhancement to allow users to specify an override image and location for GPU driver. Currently enabling the addon without editing the yaml results in an invalid endpoint definition as a result of which the driver is never installed. @torchiaf need your assistance on this.
  2. the 2nd node in the cluster has a NVIDIA GPU which only supports MIG mode vGPU. These GPU's should be ignored by the pcidevices controller and the PR: https://github.com/harvester/pcidevices/pull/65 addresses the same.

ibrokethecloud avatar Feb 07 '24 04:02 ibrokethecloud

Pre Ready-For-Testing Checklist

  • [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:

  • [ ] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:

  • [ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:

  • [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)? The PR is at:

    • [ ] Does the PR include the explanation for the fix or the feature?

    • [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:

  • [ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:

  • [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:

  • [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at:

Automation e2e test issue: harvester/tests#1159

Validated in master-a2c98e96-head (build right after v1.3.0-rc4). Closing as fixed.

noahgildersleeve avatar Mar 13 '24 00:03 noahgildersleeve