k8s-device-plugin
k8s-device-plugin copied to clipboard
Setting "failOnInitError" unexpectedly "works" with a small 2 node cluster.
/kind bug
I am using DeepOps version 20.10 to deploy a cluster. This uses Kubespray under the hood and is pretty standard as far as K8S install goes. After deploying K8S it runs a helm install of the device plugin using default values. I am seeing this fail (helm install
waits forever) when I have 3 CPU only worker nodes and 2 GPU only worker nodes. However, I am seeing this succeed when I install on a cluster with 1 CPU worker node and 1 GPU node.
I would expect both cases to fail and am unclear why the helm install
is completing on the smaller cluster. This is a minor issue, but it caused me some confusion as I was scaling up my smaller POC cluster to a larger one.
Is this expected behavior or some odd edge case? In both cases I see the proper error/warning message in the logs of the plugin Pods, but I do not get any helpful output from the helm command.
Below are the inventory
files I used and it is highly reproducable following the DeepOps guide.
[all]
mgmt01 ansible_host=10.0.0.1
gpu01 ansible_host=10.0.2.1
[kube-master]
mgmt01
[etcd]
mgmt01
[kube-node]
mgmt01
gpu01
[k8s-cluster:children]
kube-master
kube-node
[all]
mgmt01 ansible_host=10.0.0.1
mgmt02 ansible_host=10.0.0.2
mgmt03 ansible_host=10.0.0.3
gpu01 ansible_host=10.0.2.1
gpu02 ansible_host=10.0.2.2
[kube-master]
mgmt01
mgmt02
mgmt03
[etcd]
mgmt01
mgmt02
mgmt03
[kube-node]
mgmt01
mgmt02
mgmt03
gpu01
gpu02
[k8s-cluster:children]
kube-master
kube-node
I encountered a similar error but in NVIDIA/gpu-feature-discovery
. Does the failOnInitError
fix need to be applied there too? My (lightly sanitized) inventory is as follows:
[all]
mgmt ansible_host=10.1.2.2
login ansible_host=10.1.2.3
dgx1 ansible_host=10.1.2.4
dgx2 ansible_host=10.1.2.5
[kube-master]
mgmt
[etcd]
mgmt
[kube-node]
dgx1
dgx2
[k8s-cluster:children]
kube-master
kube-node
This fails with the following:
fatal: [mgmt]: FAILED! => changed=true
cmd:
- /usr/local/bin/helm
- upgrade
- --install
- gpu-feature-discovery
- nvgfd/gpu-feature-discovery
- --version
- 0.2.0
- --set
- migStrategy=mixed
- --wait
delta: '0:05:00.764662'
end: '2020-10-16 13:47:31.844141'
invocation:
module_args:
_raw_params: /usr/local/bin/helm upgrade --install "gpu-feature-discovery" "nvgfd/gpu-feature-discovery" --version "0.2.0" --set "migStrategy=mixed" --wait
_uses_shell: false
argv: null
chdir: null
creates: null
executable: null
removes: null
stdin: null
stdin_add_newline: true
strip_empty_ends: true
warn: true
msg: non-zero return code
rc: 1
start: '2020-10-16 13:42:31.079479'
stderr: 'Error: UPGRADE FAILED: timed out waiting for the condition'
stderr_lines: <omitted>
stdout: ''
stdout_lines: <omitted>
And just for completeness, the mgmt
node is just a plain old Ubuntu 18.04 VM with no GPU, not a real DGX.
Closing as stale issue. Please reopen if this is still causing issues.