cortex Design for sharing GPU across multiple APIs

Description

An API can only be given entire GPU units. Add support for fractional values for the GPU resource. Here's an example:

# cortex.yaml
- name: <string>  # API name (required)
  # ...
  compute:
    gpu: 300m

Motivation

Better GPU resource utilization across multiple APIs. This can reduce the overall costs for the user. If an API is supposed to run very rarely and doesn't need much inference capacity, then giving 100m of the GPU resource might be desirable - the current alternative is to give an entire GPU to the API, and this can be expensive/wasteful.

Additional context

There are 2 ways to address this:

At the driver level (i.e. device plugin for the GPU). This is the preferred method.
At the pod level, by having a single pod per instance which is responsible for handling the prediction requests of all API replicas of all APIs residing on that instance. This may incur significant performance penalties. The single pod also represents the single-point-of-failure, so this may be undesirable.

Open questions

If a pod uses more GPU than requested is there a way to evict it?

Useful links for the first approach (where the device plugin handles all of this):

https://github.com/kubernetes/kubernetes/issues/52757
https://github.com/AliyunContainerService/gpushare-scheduler-extender
https://github.com/sakjain92/Fractional-GPUs
https://github.com/Deepomatic/shared-gpu-nvidia-k8s-device-plugin
https://github.com/tkestack/gpu-manager

Sep 26 '20 18:09 RobertLucian

What's the status of this, @RobertLucian ? We could really use this for our workloads. Would love to help as well

Jun 02 '21 07:06 creatorrr

@creatorrr we've tabled this for now. That's because we didn't find a good reliable solution to this and some of the projects that would support this also appear to be stale.

From what I can recall, there were issues with memory isolation between different vGPUs. @miguelvr might be able to give more context as he did some research on one or two of these projects.

That being said, if you want to support this, you probably want to start with:

https://github.com/cortexlabs/cortex/blob/master/CONTRIBUTING.md
This is the Nvidia device plugin: https://github.com/cortexlabs/cortex/blob/master/manager/manifests/nvidia.yaml
This is where we install it on a new cluster: https://github.com/cortexlabs/cortex/blob/master/manager/install.sh#L65-L68
You might have to edit the labels on the instances: https://github.com/cortexlabs/cortex/blob/6ca17086cf7019d2ba8e233f3beb5ab6e1ab073f/manager/generate_eks.py#L119-L120
Check out for all the resource requests/limits "nvidia.com/gpu" in https://github.com/cortexlabs/cortex/blob/be39ba721efd4f0cd5aafaa9efdb7b6922f5db0c/pkg/workloads/k8s.go.
Make sure the taints/affinities are correctly applied.

Jun 02 '21 14:06 RobertLucian

Thanks! I'll try to work on this over the weekend. In the meantime, sharing notes from my research:

You're right, a lot of the projects mentioned in the description are stale
AliyunContainerService/gpushare-scheduler-extender requires replacing the nvidia-device-plugin and I am not sure that's a good idea
Deepomatic/shared-gpu-nvidia-k8s-device-plugin is a very limited and hacky workaround
nvidia seems to be working on this and there is even an open PR for partial support but it doesn't look like it's landing anytime soon
until then, I am looking at tkestack/gpu-manager and NTHU-LSALAB/KubeShare as possible candidates

Jun 03 '21 12:06 creatorrr

@creatorrr did you find a solution yet? Can you share anything? I'm having a deep interest in this feature as well 👍

Apr 22 '22 20:04 naxty

cortex cortex copied to clipboard

Design for sharing GPU across multiple APIs

Description

Motivation

Additional context

cortex
cortex copied to clipboard