k8s-device-plugin cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string

I think the follow configuration has a issue: the field deviceListStrategy is a array,but you provide a string. so this will be cause a issue when the init container of nvidia-device-plugin-ctr startting.

cat << EOF > /tmp/dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
EOF

Jun 02 '23 02:06 xuzimianxzm

and, the another related issue is in the init container of gpu-feature-discovery-init, it requires the field of deviceListStrategy is a string, not a array.

unable to load config: unable to finalize config: unable to parse config file: error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type string

Jun 02 '23 02:06 xuzimianxzm

Thanks @xuzimianxzm the deviceListStrategy config option was updated to be a string late in the Device Plugin's v0.14.0 release cycle and it seems the changes were never propagated to gpu-feature-discovery. This explains the error you're seeing in your second comment.

It does also seem as if we didn't implement a custom unmarshaller for the deviceListStrategy when extending this in the device plugin.

cc @cdesiniotis

Update: I have reproduced the failure in a unit tests here https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/294 and we will work on getting a fix released.

Jun 02 '23 11:06 elezar

As a workaround, could you specify the deviceListStrategy using the DEVICE_LIST_STRATEGY envvar instead?

Jun 02 '23 11:06 elezar

@elezar what do you mean? I am facing same issue. Deploying it as a daemon set with Flux, not using helm. Should I create DEVICE_LIST_STRATEGY env variable for container, set its value to envvar and exclude deviceListStrategy: "envvar" from config map

Jun 29 '23 11:06 ndacic

@ndacic this is how I solved it for me:

nodeSelector:
  nvidia.com/gpu.present: "true"
config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: "none"
        failOnInitError: true
        nvidiaDriverRoot: "/"
        plugin:
          passDeviceSpecs: false
          deviceListStrategy:
            - envvar
          deviceIDStrategy: uuid
      sharing:
        timeSlicing:
          renameByDefault: false
          resources:
          - name: nvidia.com/gpu
            replicas: 10

Jul 31 '23 15:07 alekc

This issue should be addressed in the v0.14.1 release.

@ndacic please let me know if bumping the version does not address your issue so that I can better document the workaround.

Aug 01 '23 10:08 elezar

@elezar This is still a problem with version 0.14.3.

It fails with the official example:

version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 10

but it works with the example given above. Thanks @alekc

Dec 02 '23 05:12 erikschul

@elezar I am using version 0.15.0

I need to set replicas to 1 so that I can have full resource access of the GPU node.

My config looks like this

      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          default_active_thread_percentage: 10
          resources:
            - name: nvidia.com/gpu
              replicas: 2

So my g4dn.2xlarge instance gives 40 SM, but the pods are to have 20 SM with replicas set to 2.

When I do install version 0.15.0 I get the following error

Could you please suggest to me how and where I can configure this replica count as 1 so that I do not get the error

May 21 '24 14:05 PrakChandra

It's not clear what you hope to accomplish by enabling MPS but setting its replicas to 1. If we allowed you to set replicas to 1, then you would get an MPS server started for the GPU, but only be able to connect 1 workload/pod to it (i.e. no sharing would be possible).

Can you please elaborate on exactly what your expectations are for using MPS? It sounds like maybe time-slicing is more what you are looking for. Either that, or (as I suggested before), maybe you want a way to limit the memory of each workload, but allow them all to share the same compute.

Please clarify what your expectations are. Just saying you want a way to "set replicas to 1" doesn't tell us anything, because that is a disallowed configuration for the reason mentioned above.

May 21 '24 14:05 klueska

@klueska

I provisioned an Optimised EKS GPU node g4dn.2xlarge with 1 GPU, configuration as follows

In order to have my workloads/pods get scheduled over it, I created the daemonset via helm helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.15.0 --namespace kube-system -f values.yaml

Output:

Logs:

My Config file is like this and I have updated this in the values.yaml file in order to get MPS sharing so that multiple workloads can be scheduled on the GPU node

  map: 
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          resources:
            - name: nvidia.com/gpu
              replicas: 2

===========================================================================

Issue:

When I set replicas: 2 I get the following output. This output is from one of the pods which is getting scheduled on the GPU node

In the above output, the multiprocessor count is 20, however, I need multiprocessor count as 40 so that the workloads can perform efficiently, otherwise with 20 it gets slow.

My expectation:

If I can set the replicas: 1 , then the multiprocessor count would become 40 and the workloads can do the processing efficiently.

I followed this doc and came to this expectation:

Ref: https://github.com/NVIDIA/k8s-device-plugin/tree/release-0.15

May 21 '24 16:05 PrakChandra

If you set replicas = 1 this is the same as no sharing since you will only expose a single slice that is the same as the entire GPU.

May 21 '24 16:05 elezar

@elezar @klueska Although thing didn't work from Helm configuration I was able to figure out the solution. I tweaked the value for CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to 100 so that my full GPU is accessible to all the pods. And it is working as expected>

Thanks

May 22 '24 05:05 PrakChandra

@elezar I am stuck with another issue where I am not able to get the GPU metrics.

time="2024-05-22T05:10:27Z" level=info msg="Starting dcgm-exporter"
time="2024-05-22T05:10:27Z" level=info msg="DCGM successfully initialized!"
time="2024-05-22T05:10:27Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-22T05:10:27Z" level=info msg="Pipeline starting"
time="2024-05-22T05:10:27Z" level=info msg="Starting webserver"```

Am I missing something?

```apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/accelerator: gpu
      #   nvidia.com/lama: gpu
      # nodeSelector:
      #   nvidia.com/accelerator: gpu
      #   nvidia.com/lama: gpu
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #       - matchExpressions:
      #         # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID
      #         - key: nvidia.com/accelerator
      #           operator: In
      #           values:
      #           - "gpu"
      #       - matchExpressions:
      #         # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
      #         - key: app
      #           operator: In
      #           values:
      #           - "AI-GPU"
      #           - "AI-GPU-LAMA"
      #       - matchExpressions:
      #         # We allow a GPU deployment to be forced by setting the following label to "true"
      #         - key: nvidia.com/lama
      #           operator: In
      #           values:
      #           - "gpu"        
      tolerations:
        - key: app
          value: AI-GPU
          effect: NoSchedule
          operator: Equal
        - key: nvidia/gpu
          operator: Exists
          effect: NoSchedule
        # - key: app
        #   value: AI-GPU-LAMA
        #   effect: NoSchedule
        #   operator: Equal          
          
        ## this matter nodeSelector to check your gpu node
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
        ports:
        - containerPort: 9400
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN```

May 22 '24 05:05 PrakChandra

@PrakChandra looking at your issues here, they were not related to the original post. Could you please open new issues instead of extending this thread.

May 22 '24 09:05 elezar

Sure. Thanks @elezar

May 22 '24 10:05 PrakChandra

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Feb 11 '25 04:02 github-actions[bot]