k8s-device-plugin Use mps on kubernetes

I'm trying to use mps service on kubernetes with nvidia-docker

Docker version 19.03.13,
nvidia-driver  495.44
cuda 11.5
image ：ngc tensorflow:21.11

Now I set nvidia-cuda-mps-control on the host machine，also hostIPC and hostPID has been set when nvidia-docker startup.

Now the process in container can found nvidia-cuda-mps-control process, but the process memory limit is not in effect，No matter what I use

$export CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=”0=1G,1=512MB” or set_default_device_pinned_mem_limit

how can I make MPS work correctly across multiple containers?

Dec 08 '21 02:12 somelaoda

We do not officially support MPS in nvidia-docker or kubernetes. Some users have been able to get it to work in the past, but there is no supported way to do it at the moment.

That said, we do plan to add official support for MPS in the next few months, as part of an overall improved "GPU sharing initiative" that will unify the experience for GPU sharing through CUDA multiplexing, MPS, and / or MIG.

Dec 09 '21 09:12 klueska

You can use this project for now: https://github.com/awslabs/aws-virtual-gpu-device-plugin

I added support for per client memory restrictions in my fork's README. Only works for CUDA >= 11.5 https://github.com/kuartis/kuartis-virtual-gpu-device-plugin

Jan 16 '22 23:01 ghokun

@klueska that would be great! Is there any ticket or other resource where we can follow roadmap/progress on this "GPU sharing initiative"?

Feb 05 '22 18:02 flixr

@klueska is there any good news about this project ？ I'm looking forward to you ～ 😄

Aug 08 '22 14:08 somelaoda

Is there any furthur progress with official support for MPS ?

Dec 22 '22 11:12 troycheng

Same here, a very needed feature. Any progress ?

Jan 04 '23 11:01 romainrossi

Any update on this thread? is anyone using MPS in kubernetes

Jan 07 '24 11:01 ettelr

Strange that this has been ignored for so long...

Jan 11 '24 00:01 prattcmp

This is something that is under active development. We don't have a concrete release date yet, but are targetting the first quarter of 2024.

Jan 11 '24 09:01 elezar

This is something that is under active development. We don't have a concrete release date yet, but are targetting the first quarter of 2024.

2024Q1 would be great, even in a beta version.

Jan 11 '24 21:01 prattcmp

We just released an RC for the next version of the k8s-device-plugin with support for MPS: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.15.0-rc.1?tab=readme-ov-file#with-cuda-mps

We would appreciate people to try this out and give any feedback you have before the final release in a few weeks.

Feb 26 '24 22:02 klueska

@klueska I see in the link you sent the following note: Note: Sharing with MPS is currently not supported on devices with MIG enabled. Is it planned to be supported on GPUs that are not mig enabled (like L40, L40S) ? if yes would it be closely?

Feb 28 '24 07:02 ettelr

Hey, I'm trying to deploy v0.15.0-rc.1 version, but I'm getting error:

I0228 15:18:51.597975      31 main.go:279] Retrieving plugins.
I0228 15:18:51.598008      31 factory.go:104] Detected NVML platform: found NVML library
I0228 15:18:51.598047      31 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0228 15:18:51.619657      31 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0228 15:18:51.619670      31 main.go:208] Failed to start one or more plugins. Retrying in 30s...

Running with config:

{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": true,
    "mpsRoot": "/run/nvidia/mps",
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {},
    "mps": {
      "renameByDefault": true,
      "resources": [
        {
          "name": "nvidia.com/gpu",
          "rename": "nvidia.com/gpu.shared",
          "devices": "all",
          "replicas": 10
        }
      ]
    }
  }
}

All pods are up and running:

gpu-feature-discovery-x59tw                                   2/2     Running     0          30m
gpu-operator-5bd8fb6df5-r2jrq                                 1/1     Running     0          103m
gpu-operator-node-feature-discovery-gc-78b479ccc6-kf8nk       1/1     Running     0          103m
gpu-operator-node-feature-discovery-master-569bfcd8bc-z6whl   1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-4bnwr              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-d5cmh              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-fc4vs              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-ktsl9              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-lm2gv              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-mhmjv              1/1     Running     0          103m
gpu-operator-node-feature-discovery-worker-w4mgz              1/1     Running     0          103m
nvidia-container-toolkit-daemonset-qj5p6                      1/1     Running     0          100m
nvidia-cuda-validator-slg25                                   0/1     Completed   0          95m
nvidia-dcgm-exporter-kr84r                                    1/1     Running     0          100m
nvidia-device-plugin-2svlb                                    2/2     Running     0          30m
nvidia-driver-daemonset-sznx9                                 1/1     Running     0          103m
nvidia-mig-manager-78hrq                                      1/1     Running     0          100m
nvidia-operator-validator-nhz56                               1/1     Running     0          100m

GPU:

  *-display
       description: 3D controller
       product: GA100 [A100 PCIe 40GB]
       vendor: NVIDIA Corporation
       physical id: 0
       bus info: pci@0000:13:00.0
       logical name: /dev/fb0
       version: a1
       width: 64 bits
       clock: 33MHz
       capabilities: pm bus_master cap_list fb
       configuration: depth=32 driver=nvidia latency=248 mode=1280x800 visual=truecolor xres=1280 yres=800
       resources: iomemory:1fe00-1fdff iomemory:1ff00-1feff irq:16 memory:fb000000-fbffffff memory:1fe000000000-1fefffffffff memory:1ff000000000-1ff001ffffff

I'm using 535.154.05 driver deployed with gpu-operator on Rocky 8.9. Any idea what could be the root cause?

Feb 28 '24 15:02 hrbasic

It seems you are running with the GPU operator. Support for MPS with the operator will be available in the next operator release.

If you want to test things out before then, you can disable deployment of the device plugin and GFD as part of the operator deployment, and instead install the v0.15.0-rc.1 device-plugin helm chart separately.

Feb 28 '24 15:02 klueska

Thanks for answer, I've already deployed this separately: gpu-operator

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
  name: gpu-operator

resources:
  - namespace.yaml

namespace: gpu-operator

helmCharts:
  - name: gpu-operator
    repo: https://nvidia.github.io/gpu-operator
    releaseName: gpu-operator
    namespace: gpu-operator
    valuesFile: values.yaml
    version: 23.9.1

values for operator:

driver:
  repository: my-repo.com/nvidia
  version: 535.154.05
  imagePullPolicy: Always
  imagePullSecrets: 
    - image-pull-secret
  useOpenKernelModules: true

gfd:
  enabled: false

mig:
  strategy: "none"
          
operator:
  imagePullSecrets: 
    - image-pull-secret

devicePlugin:
  enabled: false

device-plugin

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
  name: nvidia-device-plugin

namespace: gpu-operator

helmCharts:
  - name: nvidia-device-plugin
    repo: https://nvidia.github.io/k8s-device-plugin
    releaseName: nvidia-device-plugin
    namespace: gpu-operator
    valuesFile: values.yaml
    version: 0.15.0-rc.1

values for plugin:

config:
  default: default
  map:
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          renameByDefault: true
          resources:
          - name: nvidia.com/gpu
            replicas: 10

But I'll try this once more on new node, since it's deployed on node which already had "old" drivers installed by operator.

Feb 28 '24 16:02 hrbasic

OK, that should work then (it's the same way I have tested things locally). Note that there may be a few transient failures in the plugin while it waits for the MPS daemonset to start up (because it won't come online until GFD has applied a label indicating that it should be there).

Feb 28 '24 16:02 klueska

Just wanted to inform you that I successfully configured and deployed the device-plugin with MPS. I disabled NFD in the gpu-operator helm chart and enabled it in the device-plugin installation. Additionally, I had to restart\delete the nvidia-device-plugin-gpu-feature-discovery pod. I assume that restart was necessary since I've installed both helm charts simultaneously, or maybe it would eventually applied labels like you mentioned. Thanks for help, I'll keep you informed if any issues arise during the testing phase.

Feb 29 '24 07:02 hrbasic

Hello. I can confirm that installing the device-plugin version 0.15.0-rc.1 alongside gpu-operator works with the following procedure.

Install gpu-operator with nvdp and nfd
upgrade to disable nvdp and nfd
Install nvdp with gfd enabled

One slight problem I'm facing is that it does segfault if a pod has /dev/shm mounted and try to allocate GPU memory, like in the following example. Is there a workaround to avoid using /dev/shm for the mps daemon communication?

apiVersion: v1
kind: Pod
metadata:
  name: testshm
spec:
  volumes:
    - emptyDir:
        medium: Memory
      name: shared-mem
  containers:
  - name: testshm
    image: nvidia/cuda:12.3.1-base-ubuntu20.04
    command: ["tail", "-f", "/dev/null"]
    volumeMounts:
      - mountPath: /dev/shm
        name: shared-mem
    resources:
      limits:
        nvidia.com/gpu: 1

Thanks!

Mar 02 '24 03:03 igorgad

@igorgad you do not need to manually mount /dev/shm in your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove the shared-mem volumeMount?

Mar 06 '24 02:03 cdesiniotis

@igorgad you do not need to manually mount /dev/shm in your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove the shared-mem volumeMount?

To clarify: using MPS does require a /dev/shm to be set up and this is used by the MPS Control Daemon to allow for communication. The infrastructure added to the device plugin to support MPS automatically creates a tmpfs and mounts it at /dev/shm for any containers that require MPS. This means that the additional /dev/shm that you are requesting is overriding this /dev/shm that contians the information controlled by the MPS control daemon -- causing the segfaults.

Mar 06 '24 08:03 elezar

Hey @cdesiniotis and @elezar, thanks for clarifying it.

I can confirm that it works properly without the shared-mem volume mounted on the pod. However, it's common to mount a memory-backed volume on /dev/shm to increase the amount of shared memory available to python multiprocessing and pytorch dataloaders. The tmpfs mounted to /dev/shm by the device plugin has 64MB, which is too small for many workloads.

Mar 06 '24 17:03 igorgad

We have an issue to track making the shm size configurable. Would this be able to address your use case? What are typical values for the shared memory size?

Mar 06 '24 17:03 elezar

Yep. Sounds good. I generally set the shared memory size of the pod to the amount of memory requested by the pod. But I guess that's not feasible in the device plugin context. Therefore, I would say that 8GB should be enough for most workloads.

Mar 06 '24 17:03 igorgad

Any update on this one? is shm configurable?

May 02 '24 06:05 ettelr

Future versions of MPS will not depend on /dev/shm at all, making the need to inject /dev/shm unnecessary. Until then (meaning on any existing driver) this issue will continue to persist.

May 02 '24 10:05 klueska

Future versions of MPS will not depend on /dev/shm at all, making the need to inject /dev/shm unnecessary. Until then (meaning on any existing driver) this issue will continue to persist.

this means that currently torch workloads are not runnable an a system with nvidia mps deployed via device plugin since almost every torch workload will need more than the amount that's injected currently by chart

May 02 '24 10:05 ettelr

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

May 02 '24 11:05 elezar

@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?

yes it's supposed to work, we will just use it instead of injecting ourselves shm volume

May 02 '24 11:05 ettelr