Use mps on kubernetes
I'm trying to use mps service on kubernetes with nvidia-docker
Docker version 19.03.13,
nvidia-driver 495.44
cuda 11.5
image :ngc tensorflow:21.11
Now I set nvidia-cuda-mps-control on the host machine,also hostIPC and hostPID has been set when nvidia-docker startup.
Now the process in container can found nvidia-cuda-mps-control process, but the process memory limit is not in effect,No matter what I use
$export CUDA_MPS_PINNED_DEVICE_MEM_LIMIT=”0=1G,1=512MB” or set_default_device_pinned_mem_limit
how can I make MPS work correctly across multiple containers?
We do not officially support MPS in nvidia-docker or kubernetes. Some users have been able to get it to work in the past, but there is no supported way to do it at the moment.
That said, we do plan to add official support for MPS in the next few months, as part of an overall improved "GPU sharing initiative" that will unify the experience for GPU sharing through CUDA multiplexing, MPS, and / or MIG.
You can use this project for now: https://github.com/awslabs/aws-virtual-gpu-device-plugin
I added support for per client memory restrictions in my fork's README. Only works for CUDA >= 11.5 https://github.com/kuartis/kuartis-virtual-gpu-device-plugin
@klueska that would be great! Is there any ticket or other resource where we can follow roadmap/progress on this "GPU sharing initiative"?
@klueska is there any good news about this project ? I'm looking forward to you ~ 😄
Is there any furthur progress with official support for MPS ?
Same here, a very needed feature. Any progress ?
Any update on this thread? is anyone using MPS in kubernetes
Strange that this has been ignored for so long...
This is something that is under active development. We don't have a concrete release date yet, but are targetting the first quarter of 2024.
This is something that is under active development. We don't have a concrete release date yet, but are targetting the first quarter of 2024.
2024Q1 would be great, even in a beta version.
We just released an RC for the next version of the k8s-device-plugin with support for MPS: https://github.com/NVIDIA/k8s-device-plugin/tree/v0.15.0-rc.1?tab=readme-ov-file#with-cuda-mps
We would appreciate people to try this out and give any feedback you have before the final release in a few weeks.
@klueska I see in the link you sent the following note: Note: Sharing with MPS is currently not supported on devices with MIG enabled. Is it planned to be supported on GPUs that are not mig enabled (like L40, L40S) ? if yes would it be closely?
Hey, I'm trying to deploy v0.15.0-rc.1 version, but I'm getting error:
I0228 15:18:51.597975 31 main.go:279] Retrieving plugins.
I0228 15:18:51.598008 31 factory.go:104] Detected NVML platform: found NVML library
I0228 15:18:51.598047 31 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0228 15:18:51.619657 31 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0228 15:18:51.619670 31 main.go:208] Failed to start one or more plugins. Retrying in 30s...
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {},
"mps": {
"renameByDefault": true,
"resources": [
{
"name": "nvidia.com/gpu",
"rename": "nvidia.com/gpu.shared",
"devices": "all",
"replicas": 10
}
]
}
}
}
All pods are up and running:
gpu-feature-discovery-x59tw 2/2 Running 0 30m
gpu-operator-5bd8fb6df5-r2jrq 1/1 Running 0 103m
gpu-operator-node-feature-discovery-gc-78b479ccc6-kf8nk 1/1 Running 0 103m
gpu-operator-node-feature-discovery-master-569bfcd8bc-z6whl 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-4bnwr 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-d5cmh 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-fc4vs 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-ktsl9 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-lm2gv 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-mhmjv 1/1 Running 0 103m
gpu-operator-node-feature-discovery-worker-w4mgz 1/1 Running 0 103m
nvidia-container-toolkit-daemonset-qj5p6 1/1 Running 0 100m
nvidia-cuda-validator-slg25 0/1 Completed 0 95m
nvidia-dcgm-exporter-kr84r 1/1 Running 0 100m
nvidia-device-plugin-2svlb 2/2 Running 0 30m
nvidia-driver-daemonset-sznx9 1/1 Running 0 103m
nvidia-mig-manager-78hrq 1/1 Running 0 100m
nvidia-operator-validator-nhz56 1/1 Running 0 100m
GPU:
*-display
description: 3D controller
product: GA100 [A100 PCIe 40GB]
vendor: NVIDIA Corporation
physical id: 0
bus info: pci@0000:13:00.0
logical name: /dev/fb0
version: a1
width: 64 bits
clock: 33MHz
capabilities: pm bus_master cap_list fb
configuration: depth=32 driver=nvidia latency=248 mode=1280x800 visual=truecolor xres=1280 yres=800
resources: iomemory:1fe00-1fdff iomemory:1ff00-1feff irq:16 memory:fb000000-fbffffff memory:1fe000000000-1fefffffffff memory:1ff000000000-1ff001ffffff
I'm using 535.154.05 driver deployed with gpu-operator on Rocky 8.9. Any idea what could be the root cause?
It seems you are running with the GPU operator. Support for MPS with the operator will be available in the next operator release.
If you want to test things out before then, you can disable deployment of the device plugin and GFD as part of the operator deployment, and instead install the v0.15.0-rc.1 device-plugin helm chart separately.
Thanks for answer, I've already deployed this separately: gpu-operator
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: gpu-operator
resources:
- namespace.yaml
namespace: gpu-operator
helmCharts:
- name: gpu-operator
repo: https://nvidia.github.io/gpu-operator
releaseName: gpu-operator
namespace: gpu-operator
valuesFile: values.yaml
version: 23.9.1
values for operator:
driver:
repository: my-repo.com/nvidia
version: 535.154.05
imagePullPolicy: Always
imagePullSecrets:
- image-pull-secret
useOpenKernelModules: true
gfd:
enabled: false
mig:
strategy: "none"
operator:
imagePullSecrets:
- image-pull-secret
devicePlugin:
enabled: false
device-plugin
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
metadata:
name: nvidia-device-plugin
namespace: gpu-operator
helmCharts:
- name: nvidia-device-plugin
repo: https://nvidia.github.io/k8s-device-plugin
releaseName: nvidia-device-plugin
namespace: gpu-operator
valuesFile: values.yaml
version: 0.15.0-rc.1
values for plugin:
config:
default: default
map:
default: |-
version: v1
flags:
migStrategy: none
sharing:
mps:
renameByDefault: true
resources:
- name: nvidia.com/gpu
replicas: 10
But I'll try this once more on new node, since it's deployed on node which already had "old" drivers installed by operator.
OK, that should work then (it's the same way I have tested things locally). Note that there may be a few transient failures in the plugin while it waits for the MPS daemonset to start up (because it won't come online until GFD has applied a label indicating that it should be there).
Just wanted to inform you that I successfully configured and deployed the device-plugin with MPS. I disabled NFD in the gpu-operator helm chart and enabled it in the device-plugin installation. Additionally, I had to restart\delete the nvidia-device-plugin-gpu-feature-discovery pod. I assume that restart was necessary since I've installed both helm charts simultaneously, or maybe it would eventually applied labels like you mentioned. Thanks for help, I'll keep you informed if any issues arise during the testing phase.
Hello. I can confirm that installing the device-plugin version 0.15.0-rc.1 alongside gpu-operator works with the following procedure.
- Install gpu-operator with nvdp and nfd
- upgrade to disable nvdp and nfd
- Install nvdp with gfd enabled
One slight problem I'm facing is that it does segfault if a pod has /dev/shm mounted and try to allocate GPU memory, like in the following example. Is there a workaround to avoid using /dev/shm for the mps daemon communication?
apiVersion: v1
kind: Pod
metadata:
name: testshm
spec:
volumes:
- emptyDir:
medium: Memory
name: shared-mem
containers:
- name: testshm
image: nvidia/cuda:12.3.1-base-ubuntu20.04
command: ["tail", "-f", "/dev/null"]
volumeMounts:
- mountPath: /dev/shm
name: shared-mem
resources:
limits:
nvidia.com/gpu: 1
Thanks!
@igorgad you do not need to manually mount /dev/shm in your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove the shared-mem volumeMount?
@igorgad you do not need to manually mount
/dev/shmin your pod spec. The device-plugin, as part of its AllocateResponse, will make sure all the entities required for MPS get included in the container. Can you verify your example pod works when you remove theshared-memvolumeMount?
To clarify: using MPS does require a /dev/shm to be set up and this is used by the MPS Control Daemon to allow for communication. The infrastructure added to the device plugin to support MPS automatically creates a tmpfs and mounts it at /dev/shm for any containers that require MPS. This means that the additional /dev/shm that you are requesting is overriding this /dev/shm that contians the information controlled by the MPS control daemon -- causing the segfaults.
Hey @cdesiniotis and @elezar, thanks for clarifying it.
I can confirm that it works properly without the shared-mem volume mounted on the pod. However, it's common to mount a memory-backed volume on /dev/shm to increase the amount of shared memory available to python multiprocessing and pytorch dataloaders. The tmpfs mounted to /dev/shm by the device plugin has 64MB, which is too small for many workloads.
We have an issue to track making the shm size configurable. Would this be able to address your use case? What are typical values for the shared memory size?
Yep. Sounds good. I generally set the shared memory size of the pod to the amount of memory requested by the pod. But I guess that's not feasible in the device plugin context. Therefore, I would say that 8GB should be enough for most workloads.
Any update on this one? is shm configurable?
Future versions of MPS will not depend on /dev/shm at all, making the need to inject /dev/shm unnecessary. Until then (meaning on any existing driver) this issue will continue to persist.
Future versions of MPS will not depend on /dev/shm at all, making the need to inject /dev/shm unnecessary. Until then (meaning on any existing driver) this issue will continue to persist.
this means that currently torch workloads are not runnable an a system with nvidia mps deployed via device plugin since almost every torch workload will need more than the amount that's injected currently by chart
@ettelr we have an action item to allow the size of the /dev/shm that is created to be specified as part of the deployment. Would this work for your use cases?
@ettelr we have an action item to allow the size of the
/dev/shmthat is created to be specified as part of the deployment. Would this work for your use cases?
yes it's supposed to work, we will just use it instead of injecting ourselves shm volume