[BUG] When you add a vGPU to a VM it won't boot
Describe the bug
When you add a vGPU to a VM it won't boot
To Reproduce Steps to reproduce the behavior:
- Enable both pcidevices-controller and nvidia-driver-toolkit addon on a system with a GPU installed
- Navigate to SR-IOV GPU Devices in the left sidebar
- Enable one of the GPUs and select a vGPU profile
- Add the vGPU profile to a VM (I used a ubuntu focal VM)
- Save and restart the VM
Expected behavior
The vGPU should attach and the VM should boot
Support bundle
supportbundle_b580a14b-9c0c-465c-ad49-dcdde8cfbb84_2024-02-06T23-44-11Z.zip
Environment
- Harvester ISO version: v1.3.0-rc1
- Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630): 2 node bare metal DL360
Additional context
Add any other context about the problem here.
This might be fixed with this.
the mutating webhook fixes are now available in rc2
Tested in v1.3.0-rc2 in 2 node bare metal setup. Following the same steps as above the VM wouldn't start. It's worth noting that for this VM I'm using a custom storage class with 2 replicas on NVME with it being selected by disk tag.
supportbundle_22b5e31c-5e80-4018-8e22-7e031e31b980_2024-02-09T00-49-37Z.zip
Pre Ready-For-Testing Checklist
-
[ ] If labeled: require/HEP Has the Harvester Enhancement Proposal PR submitted? The HEP PR is at:
-
[x] Where is the reproduce steps/test steps documented? The reproduce steps/test steps are at:
- Follow the reproduce steps in the issue description. A VM with vGPU should boot.
-
[ ] Is there a workaround for the issue? If so, where is it documented? The workaround is at:
-
[ ] Have the backend code been merged (harvester, harvester-installer, etc) (including
backport-needed/*)? The PR is at:-
[ ] Does the PR include the explanation for the fix or the feature?
-
[ ] Does the PR include deployment change (YAML/Chart)? If so, where are the PRs for both YAML file and Chart? The PR for the YAML change is at: The PR for the chart change is at:
-
-
[ ] If labeled: area/ui Has the UI issue filed or ready to be merged? The UI issue/PR is at:
-
[ ] If labeled: require/doc, require/knowledge-base Has the necessary document PR submitted or merged? The documentation/KB PR is at:
-
[ ] If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
- The automation skeleton PR is at:
- The automation test case PR is at:
-
[ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label
release/obsolete-compatibility? The compatibility issue is filed at:
Automation e2e test issue: harvester/tests#1110
Tested in v1.3.0-rc3. Verified as fixed.