k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Support sharing GPUs

Open ktarplee opened this issue 4 years ago • 37 comments

It would be useful to allow containers/pods to share GPUs (similar to a shared workstation) when desired.

I have a fork of this device plugin that implements the above functionality. One has to label nodes as either exclusive access or shared access to the GPUs. For shared access you must specify the number of replicas of the GPUs to create. For example, if you have 4 physical GPUs on a node and want to allow each GPU to be allocated twice, one would set the replicas to 2 so there are effectively 8 GPUs for k8s to schedule.

Is this something that would be of interest to this project as a pull request?

ktarplee avatar May 12 '20 16:05 ktarplee

I think you'd want some quotas and maybe QoS to use this feature in a production environment. Otherwise it seems like some poorly behaved application could hog GPU memory and deny other pods. Failed pods would likely start up but not run to completion.

Without support from the graphics driver for sharing gracefully. I'm a bit worried that the potential limitations won't be immediately obvious to end users, and that the real use case is pretty narrow. (maybe I'm wrong and you've dealt with quotas or have GPU memory pools?)

I'd love to take a look at your work or notes, or discuss it on the Kubernetes Slack.

nvjmayo avatar May 12 '20 22:05 nvjmayo

It does not implement QoS or memory pools. It is more akin to how users use a shared server with GPUs but it does limit the oversubscription. So there is risk that you will clobber someone else, however it is unlikely when users are sporadically using the GPU (due to statistical multiplexing), for example when multiple pods want to share a GPU for model serving with infrequent requests. Besides, users opt-in to the shared GPUs by setting the nvidia.com/sharedgpu=1 (instead of nvidia.com/gpu where they would get exclusive access).

When the number of replicas for the GPUs is set to 1, the approach is equivalent to what you have currently (i.e. exclusive access). Supporting both nvidia.com/gpu (exclusive) and nvidia.com/sharedgpu (shared) on the same cluster is what I am doing right now.

ktarplee avatar May 13 '20 09:05 ktarplee

Is this related to: https://github.com/awslabs/aws-virtual-gpu-device-plugin

klueska avatar May 13 '20 16:05 klueska

The tech blog from Nvidia featuring MIG in Ampere mentioned:

a new resource type in Kubernetes via the NVIDIA Device Plugin

Could some tell me what the a new resource type is and related code?

zw0610 avatar May 15 '20 01:05 zw0610

Could some tell me what the a new resource type is and related code?

This is unrelated to the current issue (as the current issue relates to GPU sharing on pre A100 GPUs).

However, we (NVIDIA) will be releasing details of MIG support on Kubernetes soon. We have a POC of K8s working with MIG, but we want to involve the community for feedback before settling on a final design. I will share some documents (as well as code for the POC) on this early next week.

klueska avatar May 15 '20 14:05 klueska

However, we (NVIDIA) will be releasing details of MIG support on Kubernetes soon. We have a POC of K8s working with MIG, but we want to involve the community for feedback before settling on a final design. I will share some documents (as well as code for the POC) on this early next week.

Hi @klueska, any update on this?

We have a customized k8s scheduler extender for topology aware gpu scheduling and are curious about the new interface/design for k8s MIG on A100. Current device plugin sets env for nvidia container runtime which is used to mount gpu actually, so I assume there should be some changes to support MIG on k8s?

abuccts avatar May 22 '20 05:05 abuccts

We were planning on waiting until the CUDA 11 release came out to share these documents (because nothing is actually runnable for MIG without CUDA 11). However, we decided to make them public early so that people can get a head start on looking at them and giving feedback.

Here they are: Supporting Multi-Instance GPUs (MIG) in the NVIDIA Container Toolkit Challenges Supporting Multi-Instance GPUs (MIG) in Kubernetes Supporting Multi-Instance GPUs (MIG) in Kubernetes (Proof of Concept) Steps to Enable MIG Support in Kubernetes (Proof of Concept)

Any and all feedback is welcome.

klueska avatar Jun 02 '20 15:06 klueska

I finally got some time to finish reading @klueska plan in the links above. The plan is to only support A100 GPUs with CUDA 11 by supporting pre-determined fixed slices (memory and compute) of GPUs. The different sized slices would be allocated as different resource types in k8s. This sounds like a good plan for fixed slices without sharing memory and compute.

The modifications I made to the nvidia device plugin solve a slightly different problem. There are cases when you do not need guaranteed GPU availability. For example, when you are interactively experimenting in ML (think Jupyter Hub) you actually want access to all the GPU memory and all the compute resources but you might only need it for a few seconds or a few minutes and then you free them up (stop computing and free the memory). Another user on the same system can kick off their test and all is good. We can even have multiple users running their tests at the same time so long as they do not exceed the total amount of memory on the GPU.

I think there are strong use cases for both and in fact both solutions can co-exist with each other. For example you can have a parameter to create replicas of the GPU slices to allow sharing at the GPU slice level if desired.

ktarplee avatar Jun 25 '20 09:06 ktarplee

A beta release of the plugin containing MIG support has now been released: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.7.0-rc.1

As part of this, we added support for deploying the plugin via helm, including the ability to set the MIG strategy you wish to use in your deployment. Details about the various MIG strategies and how they work can be found at the link below: Supporting Multi-Instance GPUs (MIG) in Kubernetes.

A beta version of MIG Support for gpu-feature-discovery should be available very soon.

klueska avatar Jun 25 '20 10:06 klueska

@ktarplee I agree that what you propose could be useful in a testing environment or in an environment where users know exactly what they are getting into when requesting access to a sharedgpu. With the recent reorganization of the plugin and the ability to deploy it via helm charts, I am actually more open to adding specialized flags such as this now. Could you write up your proposal in more detail / point me at your existing fork where you have this implemented already.

klueska avatar Jun 25 '20 10:06 klueska

@klueska The patch was approved for public release by the U.S. Air Force so we should be able to release it soon (attach it to this issue). Would you prefer I make a pull request targeting the branch (v0.7.0-rc.1) or master?

ktarplee avatar Jul 12 '20 01:07 ktarplee

@klueska @ktarplee I don't know if you aware of this but wanted to share something Alibaba has implemented if we are going down this route:

https://www.alibabacloud.com/blog/gpu-sharing-scheduler-extender-now-supports-fine-grained-kubernetes-clusters_594926

https://github.com/AliyunContainerService/gpushare-device-plugin

https://github.com/AliyunContainerService/gpushare-scheduler-extender

zvonkok avatar Jul 21 '20 16:07 zvonkok

Attached is the patch (to be applied to master) that was approved for public release. It was developed by ACT3.

nvidia.diff

ktarplee avatar Jul 22 '20 11:07 ktarplee

@zvonkok Thanks for sharing the Alibaba approach. After reading the docs it seems your approach requires extending the scheduler and a custom device plugin. Comparing your approach to MIG approach is seems yours only schedules and limits the GPU memory but not the GPU cuda cores. Is that correct? MIG partitions GPU memory and cuda cores into chunks determined at deployment time.

@zvonkok Does your approach allow a GPU to be partitioned arbitrary at scheduling time (not deployment time)? It appears that the ALIYUN_COM_GPU_MEM_POD env var (passed to the container) is uses to set the GPU memory. I presume that is ignored by the standard nvidia runtime (via nvidia-docker2). Do you require a custom nvidia runtime for your GPU memory limit to be enforced?

ktarplee avatar Jul 22 '20 12:07 ktarplee

Just FYI, I just became aware of a similar approach to sharing GPUs (independently developed from my implementation).

ktarplee avatar Aug 18 '20 13:08 ktarplee

I finally got some time to finish reading @klueska plan in the links above. The plan is to only support A100 GPUs with CUDA 11 by supporting pre-determined fixed slices (memory and compute) of GPUs. The different sized slices would be allocated as different resource types in k8s. This sounds like a good plan for fixed slices without sharing memory and compute.

The modifications I made to the nvidia device plugin solve a slightly different problem. There are cases when you do not need guaranteed GPU availability. For example, when you are interactively experimenting in ML (think Jupyter Hub) you actually want access to all the GPU memory and all the compute resources but you might only need it for a few seconds or a few minutes and then you free them up (stop computing and free the memory). Another user on the same system can kick off their test and all is good. We can even have multiple users running their tests at the same time so long as they do not exceed the total amount of memory on the GPU.

I think there are strong use cases for both and in fact both solutions can co-exist with each other. For example you can have a parameter to create replicas of the GPU slices to allow sharing at the GPU slice level if desired.

This is what we are looking for as well. I just saw a talk at Kubecon Europe by Samed Güner from SAP about an attempt at this. It involved over advertising the GPU as well. He also mentioned the following work in this area:

https://github.com/Deepomatic/shared-gpu-nvidia-k8s-device-plugin https://github.com/tkestack/gpu-manager https://github.com/NTHU-LSALAB/KubeShare

In our case we will not have the budget to buy A100's or newer (the only place MIG will be supported I believe) and need a solution for older cards (like the Telsa T4 that we have).

RDarrylR avatar Aug 18 '20 16:08 RDarrylR

We also don't have budget for A100's or even Tesla anything... we have a bunch of GTX and RTX cards that need to be shared on our cluster.

optimuspaul avatar Feb 03 '21 20:02 optimuspaul

hi. any update on this ?

pen-pal avatar Mar 03 '21 08:03 pen-pal

@M-A-N-I-S-H-K At ACT3 we have a private fork of this project that adds GPU sharing. We just updated it to also support MIG (by pulling in the upstream changes from this project). It can now share whole GPUs or parts of GPUs (MIG slices of a GPU) to a maximum number of pods (the replication factor). It can also rename the devices. For example, we use nvidia.com/gpu for whole GPUs and nvidia.com/sharedgpu for shared GPUs. In the case of MIGs you get resource names such as nvidia.com/mig-3g.20gb. In some cases you want the MIG name to be mapped to something else such as nvidia.com/gpu-small and you can then also map your nodes with say K80 to the same extended resource, nvidia.com/gpu-small.

We intend to public release that code soon (it has already been approved for public release in the patch form above, nvidia.diff). We are currently adding an allocation policy so that the least used raw device is allocated to a new pod instead of just a random one. This will help prevent sharing until it is necessary (you have more pods requesting GPUs than you have actual GPUs).

I should mention that I am happing to make a pull request from our work as well to get that work back into this project.

ktarplee avatar Mar 03 '21 11:03 ktarplee

Recently I also became aware of another possible way to share GPUs by literally replicating the devices with symlinks in the /dev directory. Here is the link. I have not tried this yet.

ktarplee avatar Mar 03 '21 11:03 ktarplee

Recently I also became aware of another possible way to share GPUs by literally replicating the devices with symlinks in the /dev directory. Here is the link. I have not tried this yet.

I had no luck with this tactic when using the official Nvidia k8s-device-plugin (https://github.com/NVIDIA/k8s-device-plugin). Obviously I had to modify the gpu-sharing-daemonset.yaml file to suit my bare-metal installation. I am able to see the 16 devices created on the GPU node but the k8s-device-plugin must have a different way of recognizing the GPUs because only one shows up in kubectl.

bryanjonas avatar Mar 26 '21 17:03 bryanjonas

I just submitted a pull request #239 to add GPU sharing. We have been using this approach without issues for 9+ months at my organization. We have some nodes share GPUs (does not require A100 GPUs) and others that only allow exclusive access to GPUs (for heavy GPU workloads).

ktarplee avatar Apr 09 '21 12:04 ktarplee

I just submitted the gitlab MR for this feature.

ktarplee avatar Apr 15 '21 12:04 ktarplee

Hi @ktarplee

it looks like nvidia will not take ownership of this code.

I was thinking maybe it can be revised as an independent device-plugin on top of the official plugin?

I.e. after installing the official k8s-device-plugin, we then install the shared-device-plugin that can be just a thin wrapper over nvidia's official code. pods still use nvidia.com/sharedgpu and the implementation will just use the official plugin.

do you think it can be done?

rjanovski avatar Jul 26 '21 17:07 rjanovski

@rjanovski a few weeks ago we realized that we could do exactly what you described. Essentially the kubernetes sidecar adapter pattern. We do plan on implementing this approach shortly. The benefits are:

  • no modification to the Nvidia device plugin source code
  • Can be agnostic to the underlying plugin the we are adapting to allow sharing and renaming of AMD, Intel, and Nvidia GPUs. We might also be able to share non-gpu resources such as NICs or FPGAs.

ktarplee avatar Jul 27 '21 13:07 ktarplee

cool! so sidecar was a better fit than device-plugin framework? simpler to use? generic solution seems fine although I'd settle for just nvidia gpu first :) let me know when its ready, maybe I can help test

rjanovski avatar Jul 27 '21 14:07 rjanovski

@ktarplee @rjanovski - probably a naiive question/beginner question. What if I want each GPU to count as two? I don't care about scheduling/load ballancing etc. I know what I'm doing (CUDA/GPU wise), and just want the plugin to see each GPU as two GPUs. Is there a simple solution for that? I was under the impression that I would just play with the .go files here and manage to do this, however still couldn't make two pods to run on the same GPU. Any ideas/suggestions?

Thanks!

eyalhir74 avatar Aug 19 '21 05:08 eyalhir74

@eyalhir74 this vGPU plugin seemed most promising from my testing, just make sure to configure it with virtual gpu memory as well to avoid memory issues. scheduling, however, is not well supported. it may schedule 2 pods on single gpu even if vacant gpus are available. to overcome this you may need to also provide a custom scheduler to kubernetes (see aliyun solution for example)

rjanovski avatar Aug 19 '21 07:08 rjanovski

@eyalhir74 This does exactly what you said. It will let kubernetes assign a GPU to two pods. It will actually try to schedule pods on different gpus if possible. If you ask for two shared gpus you will get two physical gpus if possible. We are working on an improvement to this approach that does not require any modifications to the device plugin since Nvidia does not want to accept the merge request.

ktarplee avatar Aug 19 '21 17:08 ktarplee

@ktarplee @rjanovski - probably a naiive question/beginner question. What if I want each GPU to count as two? I don't care about scheduling/load ballancing etc. I know what I'm doing (CUDA/GPU wise), and just want the plugin to see each GPU as two GPUs. Is there a simple solution for that? I was under the impression that I would just play with the .go files here and manage to do this, however still couldn't make two pods to run on the same GPU. Any ideas/suggestions?

Thanks!

We have provided customers with the ability of GPU sharing in production, which is realized through the self-developed open source GPU framework Nano GPU, which can provide shared scheduling and allocation of GPU cards under Kubernetes. However, it requires the installation of additional extender schedulers and device plug-ins. BTW it is difficult to implement container-level gpu share completely depending on the native nvidia docker.

xiaoxubeii avatar Oct 10 '21 10:10 xiaoxubeii