k8s-device-plugin
k8s-device-plugin copied to clipboard
Using CUDA MPS to enable GPU sharing, the pod occupies all GPU memory.
I have already enabled GPU sharing using CUDA MPS, but when deploying pods with yaml, it occupies all GPU memory. Is my way of requesting GPU resources wrong?
The way I request GPU resources is as follows: resources: limits: nvidia.com.gpu: 1
How did you set up MPS?
您是如何设置 MPS 的?
I haven't set MPS in the YAML, just applied for GPU resources like in Time-slicing mode. How should I set it up? Thank you!
How did you set up MPS?
The settings to enable CUDA MPS are as follows:
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: "envvar"
deviceIDStrategy: "uuid"
gfd:
oneshot: false
noTimestamp: false
outputFile: /etc/kubernetes/node-feature-discovery/features.d/gfd
sleepInterval: 60s
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 10
@ysz-github do you have an example application / podspec that you're using to confirm this?
Could you also please confirm your driver version? We are investigating an issue where setting the device memory limits by UUID are not having the desired effect.
I have same issues using mps in docker cuda process, driver 535.129.03 and nvdp version is 0.15.0-rc1
There is a known issue with 0.15.0-rc.1 where memory limits were not correctly applied. This will be addressed in v0.15.0-rc.2 which we will release soon.
There is a known issue with 0.15.0-rc.1 where memory limits were not correctly applied. This will be addressed in v0.15.0-rc.2 which we will release soon.
ok, i know, thanks for your reply!
@aphrodite1028 @ysz-github we have just released https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0-rc.2 which should address this issue. Please let us know if you're still experiencing problems.
@aphrodite1028 @ysz-github we have just released https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0-rc.2 which should address this issue. Please let us know if you're still experiencing problems.
I found https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L77-L85 here.
if I do not set CUDA_VISIBLE_DEVICES env and start nvidia-cuda-mps-control -d and nvidia-cuda-mps-control, then limit device memory failed and not found nvidia-cuda-mps-server in container。 if I setting again, ignore mps-control-daemon ds config,will success in host machine, but Segmentation fault in container.
how to set device memory limit for client in container?
driver version is 535.129.03 GPU is RTX A6000
and i use helm deploy in k8s has an error like "linux mounts: path /run/nvidia/mps is mounted on /run but it is not a shared mount" when has mountPropagation
volumeMounts:
- mountPath: /mps
mountPropagation: Bidirectional
name: mps-root
@aphrodite1028 . You shouldn't need to do anything special in your user container. The system starts the MPS server for all GPUs on the machine and your client will be forced to make use of it.
These lines set the upper limit on the pinned device memory and thread percentage consumable by the client. https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L111-L122
You can manually adjust the pinned memory limit and thread percentage to something smaller that this using the envvars when you start your container (but you can't set it to something larger).
@aphrodite1028 . You shouldn't need to do anything special in your user container. The system starts the MPS server for all GPUs on the machine and your client will be forced to make use of it.
These lines set the upper limit on the pinned device memory and thread percentage consumable by the client. https://github.com/NVIDIA/k8s-device-plugin/blob/main/cmd/mps-control-daemon/mps/daemon.go#L111-L122
You can manually adjust the pinned memory limit and thread percentage to something smaller that this using the envvars when you start your container (but you can't set it to something larger).
thanks for your reply.
if mps pinned device memory has driver version limit when i use? I found using man nvidia-cuda-mps-control
in driver 470, not found set_default_device_pinned_mem_limit method.