Failed to start plugin: --mps-root to be specified ?
Hello,@elezar Can you help me?
k8s : 1.26.9 docker: 26.1.4 cuda: 12.4 GPU-driver: 550.90.07 nvidia-device-plugin: 0.15.1
I deployed the k8s-device-plugin with the MPS configuration, but the logs of the nvidia-device-plugin pod indicated: “using MPS requires --mps-root to be specified”. I want to know what --mps-root means. When I read the nvidia-device-plugin installation documentation, I didn’t see --mps-root in the MPS configuration section. I deployed the nvidia-device-plugin using a DaemonSet.
Here are my configure and daemonset.
# kblogs -n kube-system nvidia-device-plugin-daemonset-1-d4kvt
I0120 07:09:15.601755 1 main.go:178] Starting FS watcher.
I0120 07:09:15.601847 1 main.go:185] Starting OS watcher.
I0120 07:09:15.602405 1 main.go:200] Starting Plugins.
I0120 07:09:15.602427 1 main.go:257] Loading configuration.
E0120 07:09:15.605081 1 main.go:132] error starting plugins: unable to load config: unable to validate flags: using MPS requires --mps-root to be specified
# config1.yaml
version: v1
sharing:
mps:
renameByDefault: true
failRequestsGreaterThanOne: false
resources:
name: nvidia.com/gpu
replicas: 2
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset-1
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
nodeSelector:
kubernetes.io/hostname: "gpu-work8"
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.15.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
- name: CONFIG_FILE
value: "/etc/nvidia/config.yaml"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
- name: config-file
mountPath: /etc/nvidia/config.yaml
subPath: config1.yaml
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
- name: config-file
configMap:
name: nvidia-device-plugin-config
What are your full Helm values?
@chipzoller > What are your full Helm values?
Oh,I deployed the nvidia-device-plugin with a DaemonSet.yaml. I didn't deploy by helm.
So , the configmap and Deamonset files in the post above are all that I deployed.
How are you specifying the sharing config if that's the case? Can you provide full reproduction steps?
kubectl create configmap -n kube-system --from-file=config1.yaml
kubectl apply -f nvidia-device-plugin-daemonset-1.yaml
Yeah, just like that above.
Looks like you're just hand-rolling your own DaemonSet manifest and aren't accounting for what the Helm template renders. You aren't including the initContainer needed to properly set up MPS. You should probably be using the Helm chart and not attempting to slice this up yourself.