gpu-operator
gpu-operator copied to clipboard
A100: The GPU operator will not install the mig-manager
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04):Ubuntu22.04
- Kernel Version:5.15.0-60-generic
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):v1.24.15
- GPU Operator Version:v23.6.0
- GPU: A100 PCIe 40GB
2. Issue or feature description
When the driver and nfd were installed in advance, I am trying to install gpu-operator in the A100 environment, but the installation of gpu-operator failed and the mig-manager was missing.
GPU:
root@master1:~# lspci | grep NVIDIA 2f:00.0 3D controller: NVIDIA Corporation GA100 [A100 PCIe 40GB] (rev a1)
The gpu operaotr pod info:
root@master1:~# kubectl get pod -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-kht9c 1/1 Running 1 (36m ago) 57m
gpu-operator-8597b78788-4ncg7 1/1 Running 1 (36m ago) 57m
nvidia-container-toolkit-daemonset-pldv5 1/1 Running 1 (36m ago) 57m
nvidia-cuda-validator-tgqqk 0/1 Init:CrashLoopBackOff 1 (17s ago) 19s
nvidia-dcgm-exporter-m7hg7 1/1 Running 1 (36m ago) 57m
nvidia-device-plugin-daemonset-gjlp7 0/1 CrashLoopBackOff 17 (4m47s ago) 57m
nvidia-operator-validator-7969z 0/1 Init:2/4 6 (2m59s ago) 57m
There is no nvidia-mig-manager pod. And the error pod logs as follows:
root@master1:~# kubectl logs -n gpu-operator nvidia-device-plugin-daemonset-gjlp7
...
I0109 12:33:24.644789 1 main.go:256] Retreiving plugins.
I0109 12:33:24.645553 1 factory.go:107] Detected NVML platform: found NVML library
I0109 12:33:24.645594 1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0109 12:33:24.699911 1 main.go:123] error starting plugins: error getting plugins: failed to construct NVML resource managers: error building device map: error building device map from config.resources: invalid MIG configuration: At least one device with migEnabled=true was not configured correctly: error visiting device: device 0 has an invalid MIG configuration
3. Steps to reproduce the issue
- Install k8s cluster;
- Install nfd:
root@master1:~# kubectl get pod -n node-feature-discovery
NAME READY STATUS RESTARTS AGE
nfd-release-node-feature-discovery-master-5564946bcf-x6qzs 1/1 Running 13 (49m ago) 43d
nfd-release-node-feature-discovery-worker-x7nff 1/1 Running 11 (49m ago) 43d
- Install gpu driver: Driver Version: 535.129.03
- Install the operator:
helm install gpu-operator -n gpu-operator --create-namespace ./gpu-operator --set driver.enabled=false --set nfd.enabled=false
- Check the gpu-operator pod.
Hi @lsyLearn can you manually disable MIG on the node and then re-install the operator? Some of our components, e.g. k8s-device-plugin
, will fail if MIG is enabled without any MIG devices existing. Once the operator is re-installed and in a healthy state, the mig-manager should come up and then you can label the node to apply a MIG configuration.