k8s-device-plugin
k8s-device-plugin copied to clipboard
K3S - Failed to start plugin: error waiting for MPS daemon
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version: Ubuntu 20.04.6 LTS
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K3S Rancher
2. Issue or feature description
- I have installed and configured GPU operator v23.9.2
- I updated k8s-device-plugin to the v0.15.0 by editing the yaml template.
...
devicePlugin:
enabled: true
repository: nvcr.io/nvidia
image: k8s-device-plugin
version: "v0.15.0"
imagePullPolicy: IfNotPresent
env:
- name: MPS_ROOT
value: "/run/nvidia/mps"
...
I created a "sharing" config map, including MPS and timeslicing config, in order to switch from one to the other :
apiVersion: v1
kind: ConfigMap
metadata:
name: nvidia-sharing-config
namespace: gpu-operator
data:
a6000-ts-6: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 6
a6000-mps-4: |-
version: v1
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 8
There is absolutely no issue with the time slicing :
kubectl label node mitcv01 nvidia.com/device-plugin.config=a6000-ts-4 --overwrite
kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset
# Logs :
# I0516 13:17:50.681437 30 main.go:279] Retrieving plugins.
# I0516 13:17:50.682823 30 factory.go:104] Detected NVML platform: found NVML library
# I0516 13:17:50.682909 30 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
# I0516 13:17:50.724323 30 server.go:216] Starting GRPC server for 'nvidia.com/gpu'
# I0516 13:17:50.725710 30 server.go:147] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
# I0516 13:17:50.729389 30 server.go:154] Registered device plugin for 'nvidia.com/gpu' with Kubelet
But if i want to use MPS, i have this issue :
kubectl label node mitcv01 nvidia.com/device-plugin.config=a6000-mps-4 --overwrite
kubectl rollout restart -n gpu-operator daemonset/nvidia-device-plugin-daemonset
# Logs :
# I0516 13:19:07.992446 31 main.go:279] Retrieving plugins.
# I0516 13:19:07.993340 31 factory.go:104] Detected NVML platform: found NVML library
# I0516 13:19:07.993402 31 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found
# E0516 13:19:08.046087 31 main.go:301] Failed to start plugin: error waiting for MPS daemon: error checking MPS daemon health: # failed to send command to MPS daemon: exit status 1
# I0516 13:19:08.046116 31 main.go:208] Failed to start one or more plugins. Retrying in 30s...
Can you help me to figure out what i did wrong ? Thanks,
3. Information to attach (optional if deemed irrelevant)
Common error checking:
- [ ] The output of
nvidia-smi -aon your host - [x] The k8s-device-plugin container logs
- [ ] The kubelet logs on the node (e.g:
sudo journalctl -r -u kubelet)
Additional information that might help better understand your environment and reproduce the bug:
- [x] Docker version from
docker version
(base) ➜ $✗ nvidia-docker --version
Docker version 25.0.3, build 4debf41
- [x] NVIDIA packages version from
dpkg -l '*nvidia*'orrpm -qa '*nvidia*'
- [x] NVIDIA container library version from
nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.15.0
lib-version: 1.15.0
build date: 2024-04-15T13:36+00:00
build revision: 6c8f1df7fd32cea3280cf2a2c6e931c9b3132465
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
Which driver version are you using?
Does the log of the mps-control-daemon-ctr show any additional output?
Also to clarify. Is the device plugin deployed using the GPU operator or using the standalone helm chart?
Which driver version are you using?
Does the log of the
mps-control-daemon-ctrshow any additional output?
I'am using version 550 of the driver.
I don't have mps-control-daemon-ctr, maybe the problem is here !
Do you have a template to install it without helm ?
At the beggining i was using the plugin deployed with GPU operator (v23.9.2) but i manually overrided the yaml in order to target k8s-device-plugin v0.15.0 instead of v0.14.
I did the installation of the control daemon as an "extra", it's now up and running. I did it with this template :
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvdp-nvidia-device-plugin-mps-control-daemon
namespace: gpu-operator
labels:
helm.sh/chart: nvidia-device-plugin-0.15.0
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvdp
app.kubernetes.io/version: "0.15.0"
app.kubernetes.io/managed-by: Helm
spec:
selector:
matchLabels:
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvdp
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
app.kubernetes.io/name: nvidia-device-plugin
app.kubernetes.io/instance: nvdp
annotations:
{}
spec:
priorityClassName: system-node-critical
securityContext:
{}
initContainers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
name: mps-control-daemon-mounts
command: [mps-control-daemon, mount-shm]
securityContext:
privileged: true
volumeMounts:
- name: mps-root
mountPath: /mps
mountPropagation: Bidirectional
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.0
imagePullPolicy: IfNotPresent
name: mps-control-daemon-ctr
command: [mps-control-daemon]
env:
- name: NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: NVIDIA_MIG_MONITOR_DEVICES
value: all
- name: NVIDIA_VISIBLE_DEVICES
value: all
- name: NVIDIA_DRIVER_CAPABILITIES
value: compute,utility
securityContext:
privileged: true
volumeMounts:
- name: mps-shm
mountPath: /dev/shm
- name: mps-root
mountPath: /mps
volumes:
- name: mps-root
hostPath:
path: /run/nvidia/mps
type: DirectoryOrCreate
- name: mps-shm
hostPath:
path: /run/nvidia/mps/shm
nodeSelector:
# We only deploy this pod if the following sharing label is applied.
nvidia.com/mps.capable: "true"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: feature.node.kubernetes.io/pci-10de.present
operator: In
values:
- "true"
- matchExpressions:
- key: feature.node.kubernetes.io/cpu-model.vendor_id
operator: In
values:
- NVIDIA
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
I also modified the node labels in order to say that mps is enabled :
kubectl label node mitcv01 nvidia.com/mps.capable="true" --overwrite
The daemon starts, but it says that a "strategy" is missing :
How can i update this strategy and setup mps-control to use the same configmap as device plugin ?
kubectl patch clusterpolicy/cluster-policy \
-n gpu-operator --type merge \
-p '{"spec": {"devicePlugin": {"config": {"name": "nvidia-sharing-config"}}}}'
You need to supply the same config map / name as for the device plugin. There is also a sidecar that ensures the config is up to date in the same way that the device plugin / gfd does.
Is there a reason that you don't skip the installation of the device plugin in the operator and deploy that using helm? See for example: https://docs.google.com/document/d/1H-ddA11laPQf_1olwXRjEDbzNihxprjPr74pZ4Vdf2M/edit#heading=h.9odbb6smrel8
Great document, thanks a lot !
It was because of a lack of knowledge about how to pass the configuration to the plugin. It seems it works now thanks to your very helpfull document !
Finally i did :
helm install --dry-run gpu-operator --wait -n gpu-operator --create-namespace \
nvidia/gpu-operator --version v23.9.2 \
--set nfd.enabled=false \
--set devicePlugin.enabled=false \
--set gfd.enabled=false \
--set toolkit.enabled=false > nvidia-gpu-operator.yaml
Then for installing MPS :
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
--version=0.15.0 \
--namespace nvidia-device-plugin \
--create-namespace \
--set gfd.enabled=true \
--set config.default=nvidia-sharing \
--set-file config.map.nvidia-sharing=config/nvidia/config/dp-mps-6.yaml
Thanks again for your help.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.