k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin

Open jeffreydahan opened this issue 3 years ago • 9 comments

1. Issue or feature description

Hi, I work at Microsoft and we are getting ready to go live with the A10 VMs (https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series). Ahead of this go-live, I am trying to determine if the drivers are already located in: https://github.com/NVIDIA/k8s-device-plugin for Kubernetes

As part of the Azure Kubernetes Service deployments for GPU-enabled nodes, we use your plugin to enable to drivers/GPUs. https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin

Just not sure if A10 is included. I tried calling NVIDIA support and they said to open a case here. Thanks!

jeffreydahan avatar Jul 06 '22 12:07 jeffreydahan

@jeffreydahan the GPU Device Plugin does not install or manage the driver. The expectation is that the user install this directly on the host or that this is installed by something like the driver container.

Note that there are no device-specifics in the device plugin and as such if the correct driver is installed for a device (e.g. the A10) this should function in the same way as other NVIDIA devices.

Do you have any information as to which driver version is installed and how this is deployed?

elezar avatar Jul 06 '22 12:07 elezar

It sounds like I will need to engage our engineering team who supports the VM images. I don't yet have access to the VM in question. Thanks!

Get Outlook for iOShttps://aka.ms/o0ukef


From: Evan Lezar @.> Sent: Wednesday, July 6, 2022 3:59:23 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)

@jeffreydahanhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjeffreydahan&data=05%7C01%7C%7C34bf0d1d80284f3ac79208da5f4f5a0f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637927091702479300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Zzidd421I96k0Mn45yblLFLZYUEwzTMM%2F92LkDsDCSQ%3D&reserved=0 the GPU Device Plugin does not install or manage the driver. The expectation is that the user install this directly on the host or that this is installed by something like the driver container.

Note that there are no device-specifics in the device plugin and as such if the correct driver is installed for a device (e.g. the A10) this should function in the same way as other NVIDIA devices.

Do you have any information as to which driver version is installed and how this is deployed?

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1176192042&data=05%7C01%7C%7C34bf0d1d80284f3ac79208da5f4f5a0f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637927091702494288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kQjsjg5kIquYOru7DMh%2FNBlQEsjlwDU%2BW7EK7691Eug%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFCCPYX4XIRTIRZUEFDVSV7KXANCNFSM52ZU7E6Q&data=05%7C01%7C%7C34bf0d1d80284f3ac79208da5f4f5a0f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637927091702504288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C%2FVsXbhpJdCzZzbCcFVBqTOIA50n6CNNrR9Ffp7hP68%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jeffreydahan avatar Jul 06 '22 13:07 jeffreydahan

@jeffreydahan do you think this question is related issue#3104. When I created my cluster some time ago, I used the yaml manifest that enables the drivers in the link you posted.

How am I gonna install the drivers directly using AKS? Is this documented somewhere in the azure documentation?

rogelioamancisidor avatar Aug 01 '22 18:08 rogelioamancisidor

Which vm sku are you trying to use? The new A10?

Get Outlook for iOShttps://aka.ms/o0ukef


From: Rogelio A Mancisidor @.> Sent: Monday, August 1, 2022 9:13:02 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)

@jeffreydahanhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjeffreydahan&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7zl8uwbM%2BjX0E8J96IeehYG3ssEGc6yuZ4jZwEBCDTw%3D&reserved=0 do you think this question is related issue#3104https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2FAKS%2Fissues%2F3104&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GYl6I3u2TcNG89RTquIwbEe9ieNIV7tjhxMP8DMSX1w%3D&reserved=0. When I created my cluster some time ago, I used the yaml manifest that enables the drivers in the link you post.

How am I gonna install the drivers directly using AKS?

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1201544034&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vq2S%2BkrXLm8QDyHjAmu2TwRoxdtEmjoF9WgHJYr57H8%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFDFMUS3O7RGJCAN6GLVXAHS5ANCNFSM52ZU7E6Q&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cWUCpKl4qPaDW5Uids71mvvqMj5zIYu9e22vxWgm1e8%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jeffreydahan avatar Aug 01 '22 18:08 jeffreydahan

yes the new A10. I have tried to add 1/3-1 A10 GPU and always get the same error message.

rogelioamancisidor avatar Aug 01 '22 18:08 rogelioamancisidor

I am waiting for product group to resolve this. They are updating the driver install process. Hoping for a few weeks, but no commitment on timeline.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Rogelio A Mancisidor @.> Sent: Monday, August 1, 2022 9:21:37 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)

yes the new A10. I have tried to add 1/3-1 A10 GPU and always get the same error message.

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1201551940&data=05%7C01%7C%7C066dcc25451f4058f29a08da73eaac67%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949748997619612%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8b%2BFnno23722eVrLkex7TEOlo7LTUu3fvnwIy9TM4fs%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFC3OCBAXZKMETP3WU3VXAITDANCNFSM52ZU7E6Q&data=05%7C01%7C%7C066dcc25451f4058f29a08da73eaac67%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949748997619612%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VeCXljYw8Ir5hbtdy4%2FLUnVo8uzLj%2F4CeK8eC6CC1%2BM%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jeffreydahan avatar Aug 01 '22 18:08 jeffreydahan

@jeffreydahan how do I know when the drivers are updated? there is no info on this in the AKS webpages, not that I have seen.

rogelioamancisidor avatar Aug 22 '22 11:08 rogelioamancisidor

I am watching this for now. As of last week it is still work in progress. There may be an option to self install within a week. I am getting the details on the daemon set to use to enable the self install.

Get Outlook for iOShttps://aka.ms/o0ukef


From: Rogelio A Mancisidor @.> Sent: Monday, August 22, 2022 2:05:40 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)

@jeffreydahanhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjeffreydahan&data=05%7C01%7C%7Caa378f0a540a4670730d08da842e40c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637967631438600381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4qhiYVkTdQG453Y3vLeh2v3C95nGzaYePPWrOM5ll%2B8%3D&reserved=0 how do I know when the drivers are updated? there is no info on this in the AKS webpages, not that I have seen.

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1222197330&data=05%7C01%7C%7Caa378f0a540a4670730d08da842e40c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637967631438600381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8QYMgqYy%2Bh2TzRqPB02%2FJllqKa7BdTZamBRKcayFid8%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFD3XT6UK6XZSU7QYE3V2NNIJANCNFSM52ZU7E6Q&data=05%7C01%7C%7Caa378f0a540a4670730d08da842e40c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637967631438600381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yMkT2AYmPEasAJyPaEIODUs0trYr%2FbzNdkL3SCkKFXg%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>

jeffreydahan avatar Aug 22 '22 11:08 jeffreydahan

Is this issue documented some where? So we can follow it. All azure documentation suggests that A10 GPUs are now available for AKS. But least in our cluster, we keep getting errors when we try to add a nodepool with an A10 GPU.

UPDATE: I finally could add a nodepool with the A10 GPU. Either the fix is ready, or creating the DaemonSet again solved my problem.

UPDATE2: The SKU I thought it was using a A10 GPU , was running the code extremely slow. So, i ping into the SKU and check if it was a GPU available. There is no GPU (wandb also didnt recognize any GPU). Maybe this is the workaround that @jeffreydahan mentioned? Now that I can start an AKS instance, I could ping to the SKU and self install the drivers.

rogelioamancisidor avatar Sep 01 '22 11:09 rogelioamancisidor