k8s-device-plugin
k8s-device-plugin copied to clipboard
Support sharing GPUs
It would be useful to allow containers/pods to share GPUs (similar to a shared workstation) when desired.
I have a fork of this device plugin that implements the above functionality. One has to label nodes as either exclusive access or shared access to the GPUs. For shared access you must specify the number of replicas of the GPUs to create. For example, if you have 4 physical GPUs on a node and want to allow each GPU to be allocated twice, one would set the replicas to 2 so there are effectively 8 GPUs for k8s to schedule.
Is this something that would be of interest to this project as a pull request?
I think you'd want some quotas and maybe QoS to use this feature in a production environment. Otherwise it seems like some poorly behaved application could hog GPU memory and deny other pods. Failed pods would likely start up but not run to completion.
Without support from the graphics driver for sharing gracefully. I'm a bit worried that the potential limitations won't be immediately obvious to end users, and that the real use case is pretty narrow. (maybe I'm wrong and you've dealt with quotas or have GPU memory pools?)
I'd love to take a look at your work or notes, or discuss it on the Kubernetes Slack.
It does not implement QoS or memory pools. It is more akin to how users use a shared server with GPUs but it does limit the oversubscription. So there is risk that you will clobber someone else, however it is unlikely when users are sporadically using the GPU (due to statistical multiplexing), for example when multiple pods want to share a GPU for model serving with infrequent requests. Besides, users opt-in to the shared GPUs by setting the nvidia.com/sharedgpu=1 (instead of nvidia.com/gpu where they would get exclusive access).
When the number of replicas for the GPUs is set to 1, the approach is equivalent to what you have currently (i.e. exclusive access). Supporting both nvidia.com/gpu (exclusive) and nvidia.com/sharedgpu (shared) on the same cluster is what I am doing right now.
Is this related to: https://github.com/awslabs/aws-virtual-gpu-device-plugin
The tech blog from Nvidia featuring MIG in Ampere mentioned:
a new resource type in Kubernetes via the NVIDIA Device Plugin
Could some tell me what the a new resource type is and related code?
Could some tell me what the a new resource type is and related code?
This is unrelated to the current issue (as the current issue relates to GPU sharing on pre A100 GPUs).
However, we (NVIDIA) will be releasing details of MIG support on Kubernetes soon. We have a POC of K8s working with MIG, but we want to involve the community for feedback before settling on a final design. I will share some documents (as well as code for the POC) on this early next week.
However, we (NVIDIA) will be releasing details of MIG support on Kubernetes soon. We have a POC of K8s working with MIG, but we want to involve the community for feedback before settling on a final design. I will share some documents (as well as code for the POC) on this early next week.
Hi @klueska, any update on this?
We have a customized k8s scheduler extender for topology aware gpu scheduling and are curious about the new interface/design for k8s MIG on A100. Current device plugin sets env for nvidia container runtime which is used to mount gpu actually, so I assume there should be some changes to support MIG on k8s?
We were planning on waiting until the CUDA 11 release came out to share these documents (because nothing is actually runnable for MIG without CUDA 11). However, we decided to make them public early so that people can get a head start on looking at them and giving feedback.
Here they are: Supporting Multi-Instance GPUs (MIG) in the NVIDIA Container Toolkit Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes Supporting Multi-Instance GPUs (MIG) in Kubernetes (Proof of Concept) Steps to Enable MIG Support in Kubernetes (Proof of Concept)
Any and all feedback is welcome.
I finally got some time to finish reading @klueska plan in the links above. The plan is to only support A100 GPUs with CUDA 11 by supporting pre-determined fixed slices (memory and compute) of GPUs. The different sized slices would be allocated as different resource types in k8s. This sounds like a good plan for fixed slices without sharing memory and compute.
The modifications I made to the nvidia device plugin solve a slightly different problem. There are cases when you do not need guaranteed GPU availability. For example, when you are interactively experimenting in ML (think Jupyter Hub) you actually want access to all the GPU memory and all the compute resources but you might only need it for a few seconds or a few minutes and then you free them up (stop computing and free the memory). Another user on the same system can kick off their test and all is good. We can even have multiple users running their tests at the same time so long as they do not exceed the total amount of memory on the GPU.
I think there are strong use cases for both and in fact both solutions can co-exist with each other. For example you can have a parameter to create replicas of the GPU slices to allow sharing at the GPU slice level if desired.
A beta release of the plugin containing MIG support has now been released: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.7.0-rc.1
As part of this, we added support for deploying the plugin via helm, including the ability to set the MIG strategy
you wish to use in your deployment. Details about the various MIG strategies and how they work can be found at the link below:
Supporting Multi-Instance GPUs (MIG) in Kubernetes.
A beta version of MIG Support for gpu-feature-discovery
should be available very soon.
@ktarplee
I agree that what you propose could be useful in a testing environment or in an environment where users know exactly what they are getting into when requesting access to a sharedgpu
. With the recent reorganization of the plugin and the ability to deploy it via helm charts, I am actually more open to adding specialized flags such as this now. Could you write up your proposal in more detail / point me at your existing fork where you have this implemented already.
@klueska The patch was approved for public release by the U.S. Air Force so we should be able to release it soon (attach it to this issue). Would you prefer I make a pull request targeting the branch (v0.7.0-rc.1) or master?
@klueska @ktarplee I don't know if you aware of this but wanted to share something Alibaba has implemented if we are going down this route:
https://www.alibabacloud.com/blog/gpu-sharing-scheduler-extender-now-supports-fine-grained-kubernetes-clusters_594926
https://github.com/AliyunContainerService/gpushare-device-plugin
https://github.com/AliyunContainerService/gpushare-scheduler-extender
Attached is the patch (to be applied to master) that was approved for public release. It was developed by ACT3.
@zvonkok Thanks for sharing the Alibaba approach. After reading the docs it seems your approach requires extending the scheduler and a custom device plugin. Comparing your approach to MIG approach is seems yours only schedules and limits the GPU memory but not the GPU cuda cores. Is that correct? MIG partitions GPU memory and cuda cores into chunks determined at deployment time.
@zvonkok Does your approach allow a GPU to be partitioned arbitrary at scheduling time (not deployment time)? It appears that the ALIYUN_COM_GPU_MEM_POD env var (passed to the container) is uses to set the GPU memory. I presume that is ignored by the standard nvidia runtime (via nvidia-docker2). Do you require a custom nvidia runtime for your GPU memory limit to be enforced?
Just FYI, I just became aware of a similar approach to sharing GPUs (independently developed from my implementation).
I finally got some time to finish reading @klueska plan in the links above. The plan is to only support A100 GPUs with CUDA 11 by supporting pre-determined fixed slices (memory and compute) of GPUs. The different sized slices would be allocated as different resource types in k8s. This sounds like a good plan for fixed slices without sharing memory and compute.
The modifications I made to the nvidia device plugin solve a slightly different problem. There are cases when you do not need guaranteed GPU availability. For example, when you are interactively experimenting in ML (think Jupyter Hub) you actually want access to all the GPU memory and all the compute resources but you might only need it for a few seconds or a few minutes and then you free them up (stop computing and free the memory). Another user on the same system can kick off their test and all is good. We can even have multiple users running their tests at the same time so long as they do not exceed the total amount of memory on the GPU.
I think there are strong use cases for both and in fact both solutions can co-exist with each other. For example you can have a parameter to create replicas of the GPU slices to allow sharing at the GPU slice level if desired.
This is what we are looking for as well. I just saw a talk at Kubecon Europe by Samed Güner from SAP about an attempt at this. It involved over advertising the GPU as well. He also mentioned the following work in this area:
https://github.com/Deepomatic/shared-gpu-nvidia-k8s-device-plugin https://github.com/tkestack/gpu-manager https://github.com/NTHU-LSALAB/KubeShare
In our case we will not have the budget to buy A100's or newer (the only place MIG will be supported I believe) and need a solution for older cards (like the Telsa T4 that we have).
We also don't have budget for A100's or even Tesla anything... we have a bunch of GTX and RTX cards that need to be shared on our cluster.
hi. any update on this ?
@M-A-N-I-S-H-K At ACT3 we have a private fork of this project that adds GPU sharing. We just updated it to also support MIG (by pulling in the upstream changes from this project). It can now share whole GPUs or parts of GPUs (MIG slices of a GPU) to a maximum number of pods (the replication factor). It can also rename the devices. For example, we use nvidia.com/gpu for whole GPUs and nvidia.com/sharedgpu for shared GPUs. In the case of MIGs you get resource names such as nvidia.com/mig-3g.20gb. In some cases you want the MIG name to be mapped to something else such as nvidia.com/gpu-small and you can then also map your nodes with say K80 to the same extended resource, nvidia.com/gpu-small.
We intend to public release that code soon (it has already been approved for public release in the patch form above, nvidia.diff). We are currently adding an allocation policy so that the least used raw device is allocated to a new pod instead of just a random one. This will help prevent sharing until it is necessary (you have more pods requesting GPUs than you have actual GPUs).
I should mention that I am happing to make a pull request from our work as well to get that work back into this project.
Recently I also became aware of another possible way to share GPUs by literally replicating the devices with symlinks in the /dev directory. Here is the link. I have not tried this yet.
Recently I also became aware of another possible way to share GPUs by literally replicating the devices with symlinks in the /dev directory. Here is the link. I have not tried this yet.
I had no luck with this tactic when using the official Nvidia k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin). Obviously I had to modify the gpu-sharing-daemonset.yaml file to suit my bare-metal installation. I am able to see the 16 devices created on the GPU node but the k8s-device-plugin must have a different way of recognizing the GPUs because only one shows up in kubectl.
I just submitted a pull request #239 to add GPU sharing. We have been using this approach without issues for 9+ months at my organization. We have some nodes share GPUs (does not require A100 GPUs) and others that only allow exclusive access to GPUs (for heavy GPU workloads).
I just submitted the gitlab MR for this feature.
Hi @ktarplee
it looks like nvidia will not take ownership of this code.
I was thinking maybe it can be revised as an independent device-plugin on top of the official plugin?
I.e. after installing the official k8s-device-plugin, we then install the shared-device-plugin that can be just a thin wrapper over nvidia's official code. pods still use nvidia.com/sharedgpu and the implementation will just use the official plugin.
do you think it can be done?
@rjanovski a few weeks ago we realized that we could do exactly what you described. Essentially the kubernetes sidecar adapter pattern. We do plan on implementing this approach shortly. The benefits are:
- no modification to the Nvidia device plugin source code
- Can be agnostic to the underlying plugin the we are adapting to allow sharing and renaming of AMD, Intel, and Nvidia GPUs. We might also be able to share non-gpu resources such as NICs or FPGAs.
cool! so sidecar was a better fit than device-plugin framework? simpler to use? generic solution seems fine although I'd settle for just nvidia gpu first :) let me know when its ready, maybe I can help test
@ktarplee @rjanovski - probably a naiive question/beginner question. What if I want each GPU to count as two? I don't care about scheduling/load ballancing etc. I know what I'm doing (CUDA/GPU wise), and just want the plugin to see each GPU as two GPUs. Is there a simple solution for that? I was under the impression that I would just play with the .go files here and manage to do this, however still couldn't make two pods to run on the same GPU. Any ideas/suggestions?
Thanks!
@eyalhir74 this vGPU plugin seemed most promising from my testing, just make sure to configure it with virtual gpu memory as well to avoid memory issues. scheduling, however, is not well supported. it may schedule 2 pods on single gpu even if vacant gpus are available. to overcome this you may need to also provide a custom scheduler to kubernetes (see aliyun solution for example)
@eyalhir74 This does exactly what you said. It will let kubernetes assign a GPU to two pods. It will actually try to schedule pods on different gpus if possible. If you ask for two shared gpus you will get two physical gpus if possible. We are working on an improvement to this approach that does not require any modifications to the device plugin since Nvidia does not want to accept the merge request.
@ktarplee @rjanovski - probably a naiive question/beginner question. What if I want each GPU to count as two? I don't care about scheduling/load ballancing etc. I know what I'm doing (CUDA/GPU wise), and just want the plugin to see each GPU as two GPUs. Is there a simple solution for that? I was under the impression that I would just play with the .go files here and manage to do this, however still couldn't make two pods to run on the same GPU. Any ideas/suggestions?
Thanks!
We have provided customers with the ability of GPU sharing in production, which is realized through the self-developed open source GPU framework Nano GPU, which can provide shared scheduling and allocation of GPU cards under Kubernetes. However, it requires the installation of additional extender schedulers and device plug-ins. BTW it is difficult to implement container-level gpu share completely depending on the native nvidia docker.