Going back to disable MIG Mode is failing on A100 GKE
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: 5.15.0-1048-gke
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): GKE
- GPU Operator Version: v23.9.1
2. Issue or feature description
Deploy GPU Operator on GKE cluster using A100 Instance. By default, the Mig mode is disabled on GKE A100 node. After enabling the Mig mode and changing the partition size to small (i.e mig.config is all-1g.5gb). GPU operator not able to partition the GPU node as expected because need to reboot the A100 instance (Vsphere) to allow the GPU to be in MIG mode as in some cases, GPU reset is not allowed via the hypervisor for security reasons. After rebooting the GPU node, GPU operator is able to partition the size as expected. and all GPU operator pods are up and running.
I again changed the mig label to all-disabled, then expected result should be that after rebooting the GPU node, mig mode should be disabled and GPU Operator should running properly. But instead of this nvidia-device-plugin-daemonset pod and cuda validator are going into crashloopbackoff saying that At least one device with migEnabled=true was not configured correctly.
So our workaround for this was first maunally disabled the mig mode and then reboot the A100 instance. Then GPU Operator comes up properly and all GPU operator pods are up and running.
3. Steps to reproduce the issue
- Created a GKE cluster.
- Deployed a GPU Operator v23.9.1 through helm.
- Changes the mig.config label to all-1g.5gb.
- Reboot the A100 GPU instance.
- Again Changes the mig.config label to all-disabled.
- Reboot the A100 GPU instance.
4. Information to attach (optional if deemed irrelevant)
Logs of MIG Manager pod:
Applying the selected MIG config to the node
time="2024-02-13T10:21:13Z" level=debug msg="Parsing config file..."
time="2024-02-13T10:21:13Z" level=debug msg="Selecting specific MIG config..."
time="2024-02-13T10:21:13Z" level=debug msg="Running apply-start hook"
time="2024-02-13T10:21:13Z" level=debug msg="Checking current MIG mode..."
time="2024-02-13T10:21:13Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2024-02-13T10:21:13Z" level=debug msg=" GPU 0: 0x20B010DE"
time="2024-02-13T10:21:13Z" level=debug msg=" Asserting MIG mode: Disabled"
time="2024-02-13T10:21:13Z" level=debug msg=" MIG capable: true\n"
time="2024-02-13T10:21:13Z" level=debug msg=" Current MIG mode: Enabled"
time="2024-02-13T10:21:13Z" level=debug msg="Running pre-apply-mode hook"
time="2024-02-13T10:21:13Z" level=debug msg="Applying MIG mode change..."
time="2024-02-13T10:21:13Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2024-02-13T10:21:13Z" level=debug msg=" GPU 0: 0x20B010DE"
time="2024-02-13T10:21:13Z" level=debug msg=" MIG capable: true\n"
time="2024-02-13T10:21:13Z" level=debug msg=" Current MIG mode: Enabled"
time="2024-02-13T10:21:13Z" level=debug msg=" Clearing existing MIG configuration"
time="2024-02-13T10:21:13Z" level=debug msg=" Updating MIG mode: Disabled"
time="2024-02-13T10:21:13Z" level=debug msg=" Mode change pending: true"
time="2024-02-13T10:21:13Z" level=debug msg="At least one mode change pending"
time="2024-02-13T10:21:13Z" level=debug msg="Resetting all GPUs..."
time="2024-02-13T10:21:16Z" level=error msg="\nResetting GPU 00000000:00:04.0 is not supported.\n"
time="2024-02-13T10:21:16Z" level=debug msg="Running apply-exit hook"
time="2024-02-13T10:21:16Z" level=fatal msg="Error applying MIG configuration with hooks: error resetting all GPUs: exit status 3"
Restarting any GPU clients previously shutdown in Kubernetes by reenabling their component-specific nodeSelector labels
node/gke-cluster-ta-a100-pool-2-f53a3ff6-9btn labeled
Changing the 'nvidia.com/mig.config.state' node label to 'failed'
node/gke-cluster-ta-a100-pool-2-f53a3ff6-9btn labeled
time="2024-02-13T10:21:16Z" level=error msg="Error: exit status 1"
time="2024-02-13T10:21:16Z" level=info msg="Waiting for change to 'nvidia.com/mig.config' label"
Nvidia-smi OUTPUT (mig.config is all-disabled) :
Hi @moditanisha22 thank you for your detailed report.
This is definitely unexpected behaviour, especially given this from the mig-manager logs:
time="2024-02-13T10:21:13Z" level=debug msg="Walking MigConfig for (devices=all)"
time="2024-02-13T10:21:13Z" level=debug msg=" GPU 0: 0x20B010DE"
time="2024-02-13T10:21:13Z" level=debug msg=" MIG capable: true\n"
time="2024-02-13T10:21:13Z" level=debug msg=" Current MIG mode: Enabled"
time="2024-02-13T10:21:13Z" level=debug msg=" Clearing existing MIG configuration"
time="2024-02-13T10:21:13Z" level=debug msg=" Updating MIG mode: Disabled"
time="2024-02-13T10:21:13Z" level=debug msg=" Mode change pending: true"
time="2024-02-13T10:21:13Z" level=debug msg="At least one mode change pending"
This state is pulled directly out of the GPU, so if it sees Mode change pending: true, then a reboot should definitely apply that change.
One thing worth trying is to force the mig-manager to trigger a reboot itself at the appropriate time. This can be done by setting the following when deploying the operator:
--set-string migManager.env[0].name="WITH_REBOOT"
--set-string migManager.env[0].value="true"
Let me know if this helps resolve your issue. It could be that your reboot was somehow being triggered before the GPU had a chance to commit this pending change.