k8s-device-plugin
k8s-device-plugin copied to clipboard
Failed to send command to MPS daemon
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 5.15.0-112-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): containerd://1.6.12
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
2. Issue or feature description
I'm struggling to understand how to enable MPS with the provided README . I'm using helm chart version 0.15.0. I'm using the nvidia device plugin helm chart. I'm not using the gpu-operator chart.
Am I supposed to do something after enabling mps via the config map? I've also tried going onto the relevant gpu worker node and enabling mps via nvidia-cuda-mps-control -d
but that made no difference.
[2024-06-10 15:16:40.777 Control 111377] Starting control daemon using socket /tmp/nvidia-mps/control
[2024-06-10 15:16:40.777 Control 111377] To connect CUDA applications to this daemon, set env CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
Logs from the nvidia-device-plugin-ctr
container in the nvidia-device-plugin
pod:
Running with config:
{
"version": "v1",
"flags": {
"migStrategy": "none",
"failOnInitError": true,
"mpsRoot": "/run/nvidia/mps",
"nvidiaDriverRoot": "/",
"gdsEnabled": false,
"mofedEnabled": false,
"useNodeFeatureAPI": null,
"plugin": {
"passDeviceSpecs": false,
"deviceListStrategy": [
"envvar"
],
"deviceIDStrategy": "uuid",
"cdiAnnotationPrefix": "cdi.k8s.io/",
"nvidiaCTKPath": "/usr/bin/nvidia-ctk",
"containerDriverRoot": "/driver-root"
}
},
"resources": {
"gpus": [
{
"pattern": "*",
"name": "nvidia.com/gpu"
}
]
},
"sharing": {
"timeSlicing": {},
"mps": {
"failRequestsGreaterThanOne": true,
"resources": [
{
"name": "nvidia.com/gpu",
"devices": "all",
"replicas": 20
}
]
}
}
}
I0610 15:26:41.022164 39 main.go:279] Retrieving plugins.
I0610 15:26:41.022191 39 factory.go:104] Detected NVML platform: found NVML library
I0610 15:26:41.022226 39 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0610 15:26:41.076279 39 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: failed to send command to MPS daemon: exit status 1
I0610 15:26:41.076311 39 main.go:208] Failed to start one or more plugins. Retrying in 30s...
# values.yaml
nodeSelector: {
nvidia.com/gpu: "true"
}
gfd:
enabled: true
nameOverride: gpu-feature-discovery
namespaceOverride: <NAMESPACE>
nodeSelector: {
nvidia.com/gpu: "true"
}
nfd:
master:
nodeSelector: {
nvidia.com/gpu: "true"
}
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
worker:
nodeSelector: {
nvidia.com/gpu: "true"
}
config:
name: nvidia-device-plugin-config
# nvidia-device-plugin-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-device-plugin-config
namespace: <NAMESPACE>
data:
config: |-
version: v1
sharing:
mps:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 20
Additional information that might help better understand your environment and reproduce the bug:
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.67 Driver Version: 550.67 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S Off | 00000000:BE:00.0 Off | 0 |
| N/A 67C P0 279W / 350W | 3809MiB / 46068MiB | 97% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+