harvester icon indicating copy to clipboard operation
harvester copied to clipboard

[BUG] When you add a vGPU to a VM it won't boot

Open noahgildersleeve opened this issue 2 years ago • 3 comments

Describe the bug

When you add a vGPU to a VM it won't boot

To Reproduce Steps to reproduce the behavior:

  1. Enable both pcidevices-controller and nvidia-driver-toolkit addon on a system with a GPU installed
  2. Navigate to SR-IOV GPU Devices in the left sidebar
  3. Enable one of the GPUs and select a vGPU profile Greenshot 2024-02-06 12 47 15 Greenshot 2024-02-06 14 25 41
  4. Add the vGPU profile to a VM (I used a ubuntu focal VM) Greenshot 2024-02-06 14 25 57
  5. Save and restart the VM

Expected behavior

The vGPU should attach and the VM should boot

Support bundle

supportbundle_b580a14b-9c0c-465c-ad49-dcdde8cfbb84_2024-02-06T23-44-11Z.zip

Environment

  • Harvester ISO version: v1.3.0-rc1
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 2 node bare metal DL360

Additional context Add any other context about the problem here. Greenshot 2024-02-06 14 26 41

noahgildersleeve avatar Feb 06 '24 23:02 noahgildersleeve

This might be fixed with this.

noahgildersleeve avatar Feb 07 '24 00:02 noahgildersleeve

the mutating webhook fixes are now available in rc2

ibrokethecloud avatar Feb 07 '24 23:02 ibrokethecloud

Tested in v1.3.0-rc2 in 2 node bare metal setup. Following the same steps as above the VM wouldn't start. It's worth noting that for this VM I'm using a custom storage class with 2 replicas on NVME with it being selected by disk tag.

supportbundle_22b5e31c-5e80-4018-8e22-7e031e31b980_2024-02-09T00-49-37Z.zip

noahgildersleeve avatar Feb 09 '24 01:02 noahgildersleeve

Pre Ready-For-Testing Checklist

  • [ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:

  • [x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:

  • Follow the reproduce steps in the issue description. A VM with vGPU should boot.
  • [ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:

  • [ ] Have the backend code been merged (harvester, harvester-installer, etc) (including backport-needed/*)? The PR is at:

    • [ ] Does the PR include the explanation for the fix or the feature?

    • [ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:

  • [ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:

  • [ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:

  • [ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?

    • The automation skeleton PR is at:
    • The automation test case PR is at:
  • [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility? The compatibility issue is filed at:

Automation e2e test issue: harvester/tests#1110

Tested in v1.3.0-rc3. Verified as fixed.

noahgildersleeve avatar Mar 05 '24 01:03 noahgildersleeve