harvester icon indicating copy to clipboard operation
harvester copied to clipboard

[BUG] Unable to enable the GPU's via the GUI

Open IASN-CCC opened this issue 9 months ago • 18 comments

Describe the bug Unable to enable the GPU's that are installed in the server from the UI

Two GPUs are listed (2xNVIDIA L40) however selecting one and selecting enable nothing happens

To Reproduce Goto SR-IOV GPU Devices, Select a listed GPU and click the 3 dots to enable image

Expected behavior GPU should enable

Support bundle Please reach out to request one securely

Environment

  • Harvester ISO version: 1.3.0
  • Underlying Infrastructure: DELL R760XA Baremetal with Dual NVIDIA L40 GPUs

Additional context Add any other context about the problem here.

IASN-CCC avatar May 13 '24 10:05 IASN-CCC

are you please able to confirm that the nvidia driver addon is enabled?

ibrokethecloud avatar May 13 '24 11:05 ibrokethecloud

are you please able to confirm that the nvidia driver addon is enabled?

Hi, Yes ive enabled the driver addon and its showing as working image

If i look in the PCI device list i can also see the two GPUs and i have not enabled them for pass through

IASN-CCC avatar May 13 '24 11:05 IASN-CCC

Are the pcidevices-controllers crashing? I had to increase the limits on them and was able to enable.

bathomas avatar May 13 '24 16:05 bathomas

Are the pcidevices-controllers crashing? I had to increase the limits on them and was able to enable.

Hi, Not sure how i can check this? How did you increase the limits?

IASN-CCC avatar May 14 '24 01:05 IASN-CCC

the nvidia-driver-toolkit needs the driver location, which is http endpoint where the nvidia kvm driver is located.

From that screenshot it looks like this has not been edited so no real driver has been installed on the underlying hosts.

ibrokethecloud avatar May 14 '24 02:05 ibrokethecloud

the nvidia-driver-toolkit needs the driver location, which is http endpoint where the nvidia kvm driver is located.

From that screenshot it looks like this has not been edited so no real driver has been installed on the underlying hosts.

Oh ok, that makes sense then.

The host is not currently internet connected (sits behind a proxy and i am trying to locate all the URLs to white list in our proxy)

Will internet access resolve this or do i still need to find a location

IASN-CCC avatar May 14 '24 02:05 IASN-CCC

the http endpoint is supposed to be an internal http server where you can host the drivers. You will need access to the nvidia portal to download the nvidia kvm drivers. These are different from the opensource drivers.

Please refer to the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

ibrokethecloud avatar May 14 '24 02:05 ibrokethecloud

https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

OK, downloaded the latest KVM drivers and put them on an internal web server, updated it but still no luck image

Tested that i can hit the URL from my PC and the file starts downloading, i also can ping the host from the harvester host.

IASN-CCC avatar May 14 '24 08:05 IASN-CCC

any chance i may have a support bundle to figure out what is going on? There would be messages in the nvidia driver toolkit container / pcidevices which would provide insights on what is going on.

ibrokethecloud avatar May 14 '24 23:05 ibrokethecloud

any chance i may have a support bundle to figure out what is going on? There would be messages in the nvidia driver toolkit container / pcidevices which would provide insights on what is going on.

Sure, What is the best way to provide them to you securely?

IASN-CCC avatar May 14 '24 23:05 IASN-CCC

please email the bundle to [email protected]

ibrokethecloud avatar May 15 '24 00:05 ibrokethecloud

please email the bundle to [email protected]

Sent. Thank you

IASN-CCC avatar May 15 '24 00:05 IASN-CCC

The nvidia-driver-runtime image cannot be pulled by your nodes

nvidia-driver-runtime-5vvwn                             0/1     ImagePullBackOff   0              21h

This image is not shipped in the iso and needs to be pulled to your private registry in case your nodes do not have access to the docker hub.

Once the image is available please update the image details in the addon to point to your private registry.

This is mentioned in the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

ibrokethecloud avatar May 15 '24 06:05 ibrokethecloud

The nvidia-driver-runtime image cannot be pulled by your nodes

nvidia-driver-runtime-5vvwn                             0/1     ImagePullBackOff   0              21h

This image is not shipped in the iso and needs to be pulled to your private registry in case your nodes do not have access to the docker hub.

Once the image is available please update the image details in the addon to point to your private registry.

This is mentioned in the docs: https://docs.harvesterhci.io/v1.3/advanced/addons/nvidiadrivertoolkit

Oh ok, I didnt realise that, I thought i just had to download the driver and host it on a web server which i done

Do i need to setup this private registry also? Do i just deploy a SUSE microOS and setup as a private registry?

Thanks

IASN-CCC avatar May 15 '24 08:05 IASN-CCC

The private registry is a container registry. I do not think microOS contains a registry of its own. You could use something like goharbor to get started with a private registry

ibrokethecloud avatar May 15 '24 23:05 ibrokethecloud

The private registry is a container registry. I do not think microOS contains a registry of its own. You could use something like goharbor to get started with a private registry

Thanks for that, we will look into that.

Interestingly, i white listed all the domains the system was trying to get out to in our proxy and then configured the proxy in the harvester UI, but still no luck, it should be able to get out now

IASN-CCC avatar May 16 '24 03:05 IASN-CCC

are you able to ssh to all your nodes and just run docker pull rancher/harvester-nvidia-driver-toolkit:v1.3-20240307

if the nodes can pull this image then the addon should work

ibrokethecloud avatar May 16 '24 04:05 ibrokethecloud

docker pull rancher/harvester-nvidia-driver-toolkit:v1.3-20240307

No such luck image

IASN-CCC avatar May 16 '24 08:05 IASN-CCC