harvester
harvester copied to clipboard
[BUG] Unable to enable the GPU's via the GUI
Describe the bug Unable to enable the GPU's that are installed in the server from the UI
Two GPUs are listed (2xNVIDIA L40) however selecting one and selecting enable nothing happens
To Reproduce
Goto SR-IOV GPU Devices, Select a listed GPU and click the 3 dots to enable
Expected behavior GPU should enable
Support bundle Please reach out to request one securely
Environment
- Harvester ISO version: 1.3.0
- Underlying Infrastructure: DELL R760XA Baremetal with Dual NVIDIA L40 GPUs
Additional context Add any other context about the problem here.
are you please able to confirm that the nvidia driver addon is enabled?
are you please able to confirm that the nvidia driver addon is enabled?
Hi, Yes ive enabled the driver addon and its showing as working
If i look in the PCI device list i can also see the two GPUs and i have not enabled them for pass through
Are the pcidevices-controllers
crashing? I had to increase the limits on them and was able to enable.
Are the
pcidevices-controllers
crashing? I had to increase the limits on them and was able to enable.
Hi, Not sure how i can check this? How did you increase the limits?
the nvidia-driver-toolkit
needs the driver location, which is http endpoint where the nvidia kvm driver is located.
From that screenshot it looks like this has not been edited so no real driver has been installed on the underlying hosts.
the
nvidia-driver-toolkit
needs the driver location, which is http endpoint where the nvidia kvm driver is located.From that screenshot it looks like this has not been edited so no real driver has been installed on the underlying hosts.
Oh ok, that makes sense then.
The host is not currently internet connected (sits behind a proxy and i am trying to locate all the URLs to white list in our proxy)
Will internet access resolve this or do i still need to find a location
the http endpoint is supposed to be an internal http server where you can host the drivers. You will need access to the nvidia portal to download the nvidia kvm drivers. These are different from the opensource drivers.
Please refer to the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit
https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit
OK, downloaded the latest KVM drivers and put them on an internal web server, updated it but still no luck
Tested that i can hit the URL from my PC and the file starts downloading, i also can ping the host from the harvester host.
any chance i may have a support bundle to figure out what is going on? There would be messages in the nvidia driver toolkit container / pcidevices which would provide insights on what is going on.
any chance i may have a support bundle to figure out what is going on? There would be messages in the nvidia driver toolkit container / pcidevices which would provide insights on what is going on.
Sure, What is the best way to provide them to you securely?
please email the bundle to [email protected]
The nvidia-driver-runtime image cannot be pulled by your nodes
nvidia-driver-runtime-5vvwn 0/1 ImagePullBackOff 0 21h
This image is not shipped in the iso and needs to be pulled to your private registry in case your nodes do not have access to the docker hub.
Once the image is available please update the image details in the addon to point to your private registry.
This is mentioned in the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit
The nvidia-driver-runtime image cannot be pulled by your nodes
nvidia-driver-runtime-5vvwn 0/1 ImagePullBackOff 0 21h
This image is not shipped in the iso and needs to be pulled to your private registry in case your nodes do not have access to the docker hub.
Once the image is available please update the image details in the addon to point to your private registry.
This is mentioned in the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit
Oh ok, I didnt realise that, I thought i just had to download the driver and host it on a web server which i done
Do i need to setup this private registry also? Do i just deploy a SUSE microOS and setup as a private registry?
Thanks
The private registry is a container registry. I do not think microOS contains a registry of its own. You could use something like goharbor to get started with a private registry
The private registry is a container registry. I do not think microOS contains a registry of its own. You could use something like goharbor to get started with a private registry
Thanks for that, we will look into that.
Interestingly, i white listed all the domains the system was trying to get out to in our proxy and then configured the proxy in the harvester UI, but still no luck, it should be able to get out now
are you able to ssh to all your nodes and just run docker pull rancher/harvester-nvidia-driver-toolkit:v1.3-20240307
if the nodes can pull this image then the addon should work
docker pull rancher/harvester-nvidia-driver-toolkit:v1.3-20240307
No such luck