cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string
I think the follow configuration has a issue: the field deviceListStrategy is a array,but you provide a string. so this will be cause a issue when the init container of nvidia-device-plugin-ctr startting.
cat << EOF > /tmp/dp-example-config0.yaml
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy: envvar
deviceIDStrategy: uuid
EOF
and, the another related issue is in the init container of gpu-feature-discovery-init, it requires the field of deviceListStrategy is a string, not a array.
unable to load config: unable to finalize config: unable to parse config file: error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type string
Thanks @xuzimianxzm the deviceListStrategy config option was updated to be a string late in the Device Plugin's v0.14.0 release cycle and it seems the changes were never propagated to gpu-feature-discovery. This explains the error you're seeing in your second comment.
It does also seem as if we didn't implement a custom unmarshaller for the deviceListStrategy when extending this in the device plugin.
cc @cdesiniotis
Update: I have reproduced the failure in a unit tests here https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/294 and we will work on getting a fix released.
As a workaround, could you specify the deviceListStrategy using the DEVICE_LIST_STRATEGY envvar instead?
@elezar what do you mean? I am facing same issue. Deploying it as a daemon set with Flux, not using helm. Should I create DEVICE_LIST_STRATEGY env variable for container, set its value to envvar and exclude deviceListStrategy: "envvar" from config map
@ndacic this is how I solved it for me:
nodeSelector:
nvidia.com/gpu.present: "true"
config:
map:
default: |-
version: v1
flags:
migStrategy: "none"
failOnInitError: true
nvidiaDriverRoot: "/"
plugin:
passDeviceSpecs: false
deviceListStrategy:
- envvar
deviceIDStrategy: uuid
sharing:
timeSlicing:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 10
This issue should be addressed in the v0.14.1 release.
@ndacic please let me know if bumping the version does not address your issue so that I can better document the workaround.
@elezar This is still a problem with version 0.14.3.
It fails with the official example:
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 10
but it works with the example given above. Thanks @alekc
@elezar
I am using version 0.15.0
I need to set replicas to 1 so that I can have full resource access of the GPU node.
My config looks like this
version: v1
flags:
migStrategy: none
sharing:
mps:
default_active_thread_percentage: 10
resources:
- name: nvidia.com/gpu
replicas: 2
So my g4dn.2xlarge instance gives 40 SM, but the pods are to have 20 SM with replicas set to 2.
When I do install version 0.15.0 I get the following error
Could you please suggest to me how and where I can configure this replica count as 1 so that I do not get the error
It's not clear what you hope to accomplish by enabling MPS but setting its replicas to 1. If we allowed you to set replicas to 1, then you would get an MPS server started for the GPU, but only be able to connect 1 workload/pod to it (i.e. no sharing would be possible).
Can you please elaborate on exactly what your expectations are for using MPS? It sounds like maybe time-slicing is more what you are looking for. Either that, or (as I suggested before), maybe you want a way to limit the memory of each workload, but allow them all to share the same compute.
Please clarify what your expectations are. Just saying you want a way to "set replicas to 1" doesn't tell us anything, because that is a disallowed configuration for the reason mentioned above.
@klueska
I provisioned an Optimised EKS GPU node g4dn.2xlarge with 1 GPU, configuration as follows
In order to have my workloads/pods get scheduled over it, I created the daemonset via helm
helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.15.0 --namespace kube-system -f values.yaml
Output:
Logs:
My Config file is like this and I have updated this in the values.yaml file in order to get MPS sharing so that multiple workloads can be scheduled on the GPU node
map:
default: |-
version: v1
flags:
migStrategy: none
sharing:
mps:
resources:
- name: nvidia.com/gpu
replicas: 2
===========================================================================
Issue:
When I set replicas: 2 I get the following output. This output is from one of the pods which is getting scheduled on the GPU node
In the above output, the multiprocessor count is 20, however, I need multiprocessor count as 40 so that the workloads can perform efficiently, otherwise with 20 it gets slow.
My expectation:
If I can set the replicas: 1 , then the multiprocessor count would become 40 and the workloads can do the processing efficiently.
I followed this doc and came to this expectation:
Ref: https://github.com/NVIDIA/k8s-device-plugin/tree/release-0.15
If you set replicas = 1 this is the same as no sharing since you will only expose a single slice that is the same as the entire GPU.
@elezar @klueska
Although thing didn't work from Helm configuration I was able to figure out the solution.
I tweaked the value for CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to 100 so that my full GPU is accessible to all the pods. And it is working as expected>
Thanks
@elezar I am stuck with another issue where I am not able to get the GPU metrics.
time="2024-05-22T05:10:27Z" level=info msg="Starting dcgm-exporter"
time="2024-05-22T05:10:27Z" level=info msg="DCGM successfully initialized!"
time="2024-05-22T05:10:27Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-22T05:10:27Z" level=info msg="Pipeline starting"
time="2024-05-22T05:10:27Z" level=info msg="Starting webserver"```
Am I missing something?
```apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
nodeSelector:
nvidia.com/accelerator: gpu
# nvidia.com/lama: gpu
# nodeSelector:
# nvidia.com/accelerator: gpu
# nvidia.com/lama: gpu
# affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID
# - key: nvidia.com/accelerator
# operator: In
# values:
# - "gpu"
# - matchExpressions:
# # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
# - key: app
# operator: In
# values:
# - "AI-GPU"
# - "AI-GPU-LAMA"
# - matchExpressions:
# # We allow a GPU deployment to be forced by setting the following label to "true"
# - key: nvidia.com/lama
# operator: In
# values:
# - "gpu"
tolerations:
- key: app
value: AI-GPU
effect: NoSchedule
operator: Equal
- key: nvidia/gpu
operator: Exists
effect: NoSchedule
# - key: app
# value: AI-GPU-LAMA
# effect: NoSchedule
# operator: Equal
## this matter nodeSelector to check your gpu node
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
ports:
- containerPort: 9400
securityContext:
capabilities:
add:
- SYS_ADMIN```
@PrakChandra looking at your issues here, they were not related to the original post. Could you please open new issues instead of extending this thread.
Sure. Thanks @elezar
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.