NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin
1. Issue or feature description
Hi, I work at Microsoft and we are getting ready to go live with the A10 VMs (https://docs.microsoft.com/en-us/azure/virtual-machines/nva10v5-series). Ahead of this go-live, I am trying to determine if the drivers are already located in: https://github.com/NVIDIA/k8s-device-plugin for Kubernetes
As part of the Azure Kubernetes Service deployments for GPU-enabled nodes, we use your plugin to enable to drivers/GPUs. https://docs.microsoft.com/en-us/azure/aks/gpu-cluster#manually-install-the-nvidia-device-plugin
Just not sure if A10 is included. I tried calling NVIDIA support and they said to open a case here. Thanks!
@jeffreydahan the GPU Device Plugin does not install or manage the driver. The expectation is that the user install this directly on the host or that this is installed by something like the driver container.
Note that there are no device-specifics in the device plugin and as such if the correct driver is installed for a device (e.g. the A10) this should function in the same way as other NVIDIA devices.
Do you have any information as to which driver version is installed and how this is deployed?
It sounds like I will need to engage our engineering team who supports the VM images. I don't yet have access to the VM in question. Thanks!
Get Outlook for iOShttps://aka.ms/o0ukef
From: Evan Lezar @.> Sent: Wednesday, July 6, 2022 3:59:23 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)
@jeffreydahanhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjeffreydahan&data=05%7C01%7C%7C34bf0d1d80284f3ac79208da5f4f5a0f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637927091702479300%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Zzidd421I96k0Mn45yblLFLZYUEwzTMM%2F92LkDsDCSQ%3D&reserved=0 the GPU Device Plugin does not install or manage the driver. The expectation is that the user install this directly on the host or that this is installed by something like the driver container.
Note that there are no device-specifics in the device plugin and as such if the correct driver is installed for a device (e.g. the A10) this should function in the same way as other NVIDIA devices.
Do you have any information as to which driver version is installed and how this is deployed?
— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1176192042&data=05%7C01%7C%7C34bf0d1d80284f3ac79208da5f4f5a0f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637927091702494288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kQjsjg5kIquYOru7DMh%2FNBlQEsjlwDU%2BW7EK7691Eug%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFCCPYX4XIRTIRZUEFDVSV7KXANCNFSM52ZU7E6Q&data=05%7C01%7C%7C34bf0d1d80284f3ac79208da5f4f5a0f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637927091702504288%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=C%2FVsXbhpJdCzZzbCcFVBqTOIA50n6CNNrR9Ffp7hP68%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
@jeffreydahan do you think this question is related issue#3104. When I created my cluster some time ago, I used the yaml manifest that enables the drivers in the link you posted.
How am I gonna install the drivers directly using AKS? Is this documented somewhere in the azure documentation?
Which vm sku are you trying to use? The new A10?
Get Outlook for iOShttps://aka.ms/o0ukef
From: Rogelio A Mancisidor @.> Sent: Monday, August 1, 2022 9:13:02 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)
@jeffreydahanhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjeffreydahan&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7zl8uwbM%2BjX0E8J96IeehYG3ssEGc6yuZ4jZwEBCDTw%3D&reserved=0 do you think this question is related issue#3104https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAzure%2FAKS%2Fissues%2F3104&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=GYl6I3u2TcNG89RTquIwbEe9ieNIV7tjhxMP8DMSX1w%3D&reserved=0. When I created my cluster some time ago, I used the yaml manifest that enables the drivers in the link you post.
How am I gonna install the drivers directly using AKS?
— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1201544034&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vq2S%2BkrXLm8QDyHjAmu2TwRoxdtEmjoF9WgHJYr57H8%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFDFMUS3O7RGJCAN6GLVXAHS5ANCNFSM52ZU7E6Q&data=05%7C01%7C%7Ce769be56f4e84400c3c808da73e979ad%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949743846930517%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=cWUCpKl4qPaDW5Uids71mvvqMj5zIYu9e22vxWgm1e8%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
yes the new A10. I have tried to add 1/3-1 A10 GPU and always get the same error message.
I am waiting for product group to resolve this. They are updating the driver install process. Hoping for a few weeks, but no commitment on timeline.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Rogelio A Mancisidor @.> Sent: Monday, August 1, 2022 9:21:37 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)
yes the new A10. I have tried to add 1/3-1 A10 GPU and always get the same error message.
— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1201551940&data=05%7C01%7C%7C066dcc25451f4058f29a08da73eaac67%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949748997619612%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8b%2BFnno23722eVrLkex7TEOlo7LTUu3fvnwIy9TM4fs%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFC3OCBAXZKMETP3WU3VXAITDANCNFSM52ZU7E6Q&data=05%7C01%7C%7C066dcc25451f4058f29a08da73eaac67%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637949748997619612%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=VeCXljYw8Ir5hbtdy4%2FLUnVo8uzLj%2F4CeK8eC6CC1%2BM%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
@jeffreydahan how do I know when the drivers are updated? there is no info on this in the AKS webpages, not that I have seen.
I am watching this for now. As of last week it is still work in progress. There may be an option to self install within a week. I am getting the details on the daemon set to use to enable the self install.
Get Outlook for iOShttps://aka.ms/o0ukef
From: Rogelio A Mancisidor @.> Sent: Monday, August 22, 2022 2:05:40 PM To: NVIDIA/k8s-device-plugin @.> Cc: suburbancoder @.>; Mention @.> Subject: Re: [NVIDIA/k8s-device-plugin] NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin (Issue #317)
@jeffreydahanhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjeffreydahan&data=05%7C01%7C%7Caa378f0a540a4670730d08da842e40c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637967631438600381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4qhiYVkTdQG453Y3vLeh2v3C95nGzaYePPWrOM5ll%2B8%3D&reserved=0 how do I know when the drivers are updated? there is no info on this in the AKS webpages, not that I have seen.
— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNVIDIA%2Fk8s-device-plugin%2Fissues%2F317%23issuecomment-1222197330&data=05%7C01%7C%7Caa378f0a540a4670730d08da842e40c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637967631438600381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=8QYMgqYy%2Bh2TzRqPB02%2FJllqKa7BdTZamBRKcayFid8%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKJ5PFD3XT6UK6XZSU7QYE3V2NNIJANCNFSM52ZU7E6Q&data=05%7C01%7C%7Caa378f0a540a4670730d08da842e40c9%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637967631438600381%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=yMkT2AYmPEasAJyPaEIODUs0trYr%2FbzNdkL3SCkKFXg%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
Is this issue documented some where? So we can follow it. All azure documentation suggests that A10 GPUs are now available for AKS. But least in our cluster, we keep getting errors when we try to add a nodepool with an A10 GPU.
UPDATE: I finally could add a nodepool with the A10 GPU. Either the fix is ready, or creating the DaemonSet again solved my problem.
UPDATE2: The SKU I thought it was using a A10 GPU , was running the code extremely slow. So, i ping into the SKU and check if it was a GPU available. There is no GPU (wandb also didnt recognize any GPU). Maybe this is the workaround that @jeffreydahan mentioned? Now that I can start an AKS instance, I could ping to the SKU and self install the drivers.