gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

A100: The GPU operator will not install the mig-manager

Open lsyLearn opened this issue 1 year ago • 1 comments

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
  • Kernel Version:5.15.0-60-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
  • GPU Operator Version:v23.6.0
  • GPU: A100 PCIe 40GB

2. Issue or feature description

When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing. GPU: root@master1:~# lspci | grep NVIDIA 2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1) The gpu operaotr pod info:

root@master1:~# kubectl get pod -n gpu-operator
NAME                                       READY   STATUS                  RESTARTS         AGE
gpu-feature-discovery-kht9c                1/1     Running                 1 (36m ago)      57m
gpu-operator-8597b78788-4ncg7              1/1     Running                 1 (36m ago)      57m
nvidia-container-toolkit-daemonset-pldv5   1/1     Running                 1 (36m ago)      57m
nvidia-cuda-validator-tgqqk                0/1     Init:CrashLoopBackOff   1 (17s ago)      19s
nvidia-dcgm-exporter-m7hg7                 1/1     Running                 1 (36m ago)      57m
nvidia-device-plugin-daemonset-gjlp7       0/1     CrashLoopBackOff        17 (4m47s ago)   57m
nvidia-operator-validator-7969z            0/1     Init:2/4                6 (2m59s ago)    57m

There is no nvidia-mig-manager pod. And the error pod logs as follows:

root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789       1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553       1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911       1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration

3. Steps to reproduce the issue

  • Install k8s cluster;
  • Install nfd:
root@master1:~# kubectl get pod -n node-feature-discovery
NAME                                                         READY   STATUS    RESTARTS       AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs   1/1     Running   13 (49m ago)   43d
nfd-release-node-feature-discovery-worker-x7nff              1/1     Running   11 (49m ago)   43d
  • Install gpu driver: Driver Version: 535.129.03 nvidia-smi
  • Install the operator: helm install gpu-operator -n gpu-operator --create-namespace ./gpu-operator --set driver.enabled=false --set nfd.enabled=false
  • Check the gpu-operator pod.

lsyLearn avatar Jan 09 '24 13:01 lsyLearn

Hi @lsyLearn can you manually disable MIG on the node and then re-install the operator? Some of our components, e.g. k8s-device-plugin, will fail if MIG is enabled without any MIG devices existing. Once the operator is re-installed and in a healthy state, the mig-manager should come up and then you can label the node to apply a MIG configuration.

cdesiniotis avatar Jan 25 '24 21:01 cdesiniotis