k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

Failed to start plugin: --mps-root to be specified ?

Open Charlie-L-K opened this issue 11 months ago • 5 comments

Hello,@elezar Can you help me?

k8s : 1.26.9 docker: 26.1.4 cuda: 12.4 GPU-driver: 550.90.07 nvidia-device-plugin: 0.15.1

I deployed the k8s-device-plugin with the MPS configuration, but the logs of the nvidia-device-plugin pod indicated: “using MPS requires --mps-root to be specified”. I want to know what --mps-root means. When I read the nvidia-device-plugin installation documentation, I didn’t see --mps-root in the MPS configuration section. I deployed the nvidia-device-plugin using a DaemonSet.

Here are my configure and daemonset.

# kblogs -n kube-system nvidia-device-plugin-daemonset-1-d4kvt
I0120 07:09:15.601755       1 main.go:178] Starting FS watcher.
I0120 07:09:15.601847       1 main.go:185] Starting OS watcher.
I0120 07:09:15.602405       1 main.go:200] Starting Plugins.
I0120 07:09:15.602427       1 main.go:257] Loading configuration.
E0120 07:09:15.605081       1 main.go:132] error starting plugins: unable to load config: unable to validate flags: using MPS requires --mps-root to be specified
# config1.yaml
version: v1
sharing:
  mps:
    renameByDefault: true
    failRequestsGreaterThanOne: false
    resources:
      name: nvidia.com/gpu
      replicas: 2
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset-1
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:

   - key: nvidia.com/gpu
     operator: Exists
     effect: NoSchedule
       nodeSelector:
         kubernetes.io/hostname: "gpu-work8"
       priorityClassName: "system-node-critical"
       containers:

   - image: nvcr.io/nvidia/k8s-device-plugin:v0.15.1
     name: nvidia-device-plugin-ctr
     env:
       - name: FAIL_ON_INIT_ERROR
         value: "false"
       - name: CONFIG_FILE
         value: "/etc/nvidia/config.yaml"
         securityContext:
           allowPrivilegeEscalation: false
           capabilities:
         drop: ["ALL"]
         volumeMounts:
     - name: device-plugin
       mountPath: /var/lib/kubelet/device-plugins
     - name: config-file
       mountPath: /etc/nvidia/config.yaml
       subPath: config1.yaml
           volumes:
          - name: device-plugin
            hostPath:
            path: /var/lib/kubelet/device-plugins
               - name: config-file
                 configMap:
                 name: nvidia-device-plugin-config

Charlie-L-K avatar Jan 20 '25 08:01 Charlie-L-K

What are your full Helm values?

chipzoller avatar Jan 22 '25 00:01 chipzoller

@chipzoller > What are your full Helm values?

Oh,I deployed the nvidia-device-plugin with a DaemonSet.yaml. I didn't deploy by helm.

So , the configmap and Deamonset files in the post above are all that I deployed.

Charlie-L-K avatar Feb 12 '25 07:02 Charlie-L-K

How are you specifying the sharing config if that's the case? Can you provide full reproduction steps?

chipzoller avatar Feb 12 '25 12:02 chipzoller

kubectl create configmap -n kube-system --from-file=config1.yaml kubectl apply -f nvidia-device-plugin-daemonset-1.yaml Yeah, just like that above.

Charlie-L-K avatar Feb 24 '25 01:02 Charlie-L-K

Looks like you're just hand-rolling your own DaemonSet manifest and aren't accounting for what the Helm template renders. You aren't including the initContainer needed to properly set up MPS. You should probably be using the Helm chart and not attempting to slice this up yourself.

chipzoller avatar Feb 24 '25 17:02 chipzoller