k8s-device-plugin Setting "failOnInitError" unexpectedly "works" with a small 2 node cluster.

/kind bug

I am using DeepOps version 20.10 to deploy a cluster. This uses Kubespray under the hood and is pretty standard as far as K8S install goes. After deploying K8S it runs a helm install of the device plugin using default values. I am seeing this fail (helm install waits forever) when I have 3 CPU only worker nodes and 2 GPU only worker nodes. However, I am seeing this succeed when I install on a cluster with 1 CPU worker node and 1 GPU node.

I would expect both cases to fail and am unclear why the helm install is completing on the smaller cluster. This is a minor issue, but it caused me some confusion as I was scaling up my smaller POC cluster to a larger one.

Is this expected behavior or some odd edge case? In both cases I see the proper error/warning message in the logs of the plugin Pods, but I do not get any helpful output from the helm command.

Below are the inventory files I used and it is highly reproducable following the DeepOps guide.

[all]
mgmt01     ansible_host=10.0.0.1
gpu01      ansible_host=10.0.2.1

[kube-master]
mgmt01

[etcd]
mgmt01

[kube-node]
mgmt01
gpu01

[k8s-cluster:children]
kube-master
kube-node

[all]
mgmt01     ansible_host=10.0.0.1
mgmt02     ansible_host=10.0.0.2
mgmt03     ansible_host=10.0.0.3
gpu01      ansible_host=10.0.2.1
gpu02      ansible_host=10.0.2.2

[kube-master]
mgmt01
mgmt02
mgmt03

[etcd]
mgmt01
mgmt02
mgmt03

[kube-node]
mgmt01
mgmt02
mgmt03
gpu01
gpu02

[k8s-cluster:children]
kube-master
kube-node

Oct 13 '20 03:10 supertetelman

I encountered a similar error but in NVIDIA/gpu-feature-discovery. Does the failOnInitError fix need to be applied there too? My (lightly sanitized) inventory is as follows:

[all]
mgmt  ansible_host=10.1.2.2
login ansible_host=10.1.2.3
dgx1  ansible_host=10.1.2.4
dgx2  ansible_host=10.1.2.5

[kube-master]
mgmt

[etcd]
mgmt

[kube-node]
dgx1
dgx2

[k8s-cluster:children]
kube-master
kube-node

This fails with the following:

fatal: [mgmt]: FAILED! => changed=true
  cmd:
  - /usr/local/bin/helm
  - upgrade
  - --install
  - gpu-feature-discovery
  - nvgfd/gpu-feature-discovery
  - --version
  - 0.2.0
  - --set
  - migStrategy=mixed
  - --wait
  delta: '0:05:00.764662'
  end: '2020-10-16 13:47:31.844141'
  invocation:
    module_args:
      _raw_params: /usr/local/bin/helm upgrade --install "gpu-feature-discovery" "nvgfd/gpu-feature-discovery" --version "0.2.0" --set "migStrategy=mixed" --wait
      _uses_shell: false
      argv: null
      chdir: null
      creates: null
      executable: null
      removes: null
      stdin: null
      stdin_add_newline: true
      strip_empty_ends: true
      warn: true
  msg: non-zero return code
  rc: 1
  start: '2020-10-16 13:42:31.079479'
  stderr: 'Error: UPGRADE FAILED: timed out waiting for the condition'
  stderr_lines: <omitted>
  stdout: ''
  stdout_lines: <omitted>

And just for completeness, the mgmt node is just a plain old Ubuntu 18.04 VM with no GPU, not a real DGX.

Oct 16 '20 13:10 crd477

Closing as stale issue. Please reopen if this is still causing issues.

Feb 20 '23 15:02 klueska

k8s-device-plugin k8s-device-plugin copied to clipboard

Setting "failOnInitError" unexpectedly "works" with a small 2 node cluster.

k8s-device-plugin
k8s-device-plugin copied to clipboard