k8s-device-plugin New strategy for MIG enable GPUs that will advertise no MIG slices if not created

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
Kernel Version: Linux 5.15.0-97-lowlatency
Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker
K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): KinD

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

The mixed strategy requires at least one MIG slice to be available on the device for K8s-device-plugin and GPU-feature-discovery pods to be in state Running. In our use case, we want to advertise MIG slices created dynamically and there would be scenarios in our setup where GPU devices will be MIG enabled but no actual MIG slices would be present on the device.

We want a mechanism or new strategy to handle dynamic MIG creation use cases.

Jun 13 '24 16:06 asm582

@klueska ptal

Jun 14 '24 18:06 mrunalp

@mrunalp @asm582 could you clarify what you would expect the behaviour to be?

I have reviewed the code and we should be generating SOME labels if mig-strategy=mixed even if all MIG-enabled devices are empty. Which labels are you looking for specifically?

Could you provide the logs of a GFD pod that is not in the running state for this configuration?

Jun 17 '24 12:06 elezar

They want a new MIG strategy that allows a GPU to exist in MIG mode without having any MIG devices configured on it. Right now we error out if this is the case. The purpose being so that they can dynamically create MIG devices on such GPUs, kick the plugin to restart, and then start advertising those MIG devices as "mixed-strategy-style" resources.

Jun 17 '24 12:06 klueska

To support this properly, the MIG manager will also need to be updated to:

Allow one to trigger a MIG mode change via a label (as it does today); BUT
NOT reapply this configuration if/when the MIG manager is restarted

In other words -- don't persist any MIG configs that get applied. Only apply a config at the moment it is requested.

Jun 17 '24 12:06 klueska

Hi @elezar Do you need more details on this?

Jun 24 '24 14:06 asm582

I chatted with @elezar and I am going to work on this later this week or next.

I was also thinking about my comment above a bit more, and I actually think we don't need any mig-manager changes. To avoid having the mig-manager reapply its "known" configuration after a reboot / restart, you simply have to remove the label.

Meaning that your controller should apply a label to set up a specific config (presumably with some GPUs set to MIG enabled and some set to MIG disabled), wait for the mig-manager to complete, and then remove the label. If no label is set, the mig-manager simply doesn't try and apply any config.

Jun 24 '24 14:06 klueska

In the end, I decided to just change the mixed strategy to warn when no MIG devices are configured instead of erroring: https://github.com/NVIDIA/k8s-device-plugin/pull/806

Jul 08 '24 11:07 klueska

Thank you for creating the fix for the mixed strategy. we are facing issues using MIG Manager:

Start MIG manager with mig.config as all-enabled
Let the operator configure GPUs with MIG enabled but have zero MIG slices
When nvidia.com/mig.config.state=success we remove the label mig.config from the node
(issue) MIG manager applies nvidia.com/mig.config=all-disabled labels and disables MIG on the GPU.

We need a setup where the MIG partitions are untouched when the MIG manager pod restarts. @klueska @elezar

Jul 10 '24 15:07 asm582

The mig-manager itself doesn't apply any "default" config. It only applies a change if a label is set. If no label is set, it will just sit in a wait loop, waiting for one to be set with some config.

@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?

Jul 10 '24 15:07 klueska

One way to work around this (and possibly even the "right" solution going forward) would be to deploy the operator with nvidia.com/mig.config=all-enabled on the nodes you want configured that way, wait for nvidia.com/mig.config.state=success on those nodes, and then disable the mig-manager altogether on those nodes by setting the label nvidia.com/gpu.deploy.mig-manager=false.

Jul 10 '24 15:07 klueska

@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?

Yes, see https://github.com/NVIDIA/gpu-operator/blob/main/controllers/state_manager.go#L538-L546

Jul 10 '24 22:07 cdesiniotis

@asm582 if you set migManager.config.default="", then the operator will not apply a default label. So after you remove the mig.config label, the label should remain unset.

Jul 10 '24 22:07 cdesiniotis

Let me list some of my findings. Note though that this is OpenShift - not vanilla K8s. The NVIDIA GPU operator is 24.3.

When the cluster policy is deployed with migManager.config.default="", no MIG manager pod is created. So if I want to enable MIG, for instance, I have to explicitly tell the operator to deploy the MIG manager kubectl label node $node nvidia.com/gpu.deploy.mig-manager=true --overwrite. Then the MIG manager starts and I can enable MIG without creating any slices kubectl label node $node nvidia.com/mig.config=all-enabled --overwrite (MIG strategy is mixed).
When there are no MIG slices while MIG is enabled on the GPU, the CUDA validator pod will CrashLoopBackOff and the operator validator pod will keep waiting for initialization. This makes sense, but is a bit annoying. More on this later.
Now, when MIG is enabled through the MIG manager, we can disable the manager as suggested, so that it doesn't interfere with other ways to manage MIG slices. It works.

kubectl label node $node nvidia.com/gpu.deploy.mig-manager=false --overwrite
kubectl label node $node nvidia.com/mig.config-

I tried to disable the validator to get rid of the error status: kubectl label node $node nvidia.com/gpu.deploy.operator-validator=false --overwrite. This removes at least the operator validator pod. However, if the operator validator is disabled before the device plugin has a chance to start, the plugin will never run. This should be kept in mind.
I deployed a workload pod that requested nvidia.com/mig-1g.5gb: 1 and created a MIG slice manually to satisfy that, using nvidia-smi. Just for testing. I had to delete the device plugin pod and let it be re-created, so that it picks up the MIG changes. The pod had remained pending until the device plugin advertised the nvidia.com/mig-1g.5gb capacity. After that the workload pod (vectoradd) ran successfully.
After deleting the workload pod, and in line with the manual testing, I tried to delete the MIG partition (nvidia-smi mig -dgi -gi 9 or nvidia-smi mig -dgi). However, I got this error:

Unable to destroy GPU instance ID  9 from GPU  0: In use by another client
Failed to destroy GPU instances: In use by another client
command terminated with exit code 19

The error persisted no matter what "client" I tried to stop. I disabled DCGM/DCGM exporter, the validators, even the device plugin, but nothing has helped. I believe at this point I can only reset the GPU before I can delete the MIG slice.

Jul 15 '24 14:07 empovit

Thanks @empovit. @klueska @cdesiniotis any pointers that can help us with step 6 of the above comment?

Jul 15 '24 14:07 asm582

You cannot delete a gi until you have deleted its ci.

Jul 15 '24 18:07 klueska

@klueska thanks! You're absolutely right. So one problem less :) And a major one.

Jul 15 '24 18:07 empovit

Everything except 4 seems to be expected (if not ideal).

@cdesiniotis @tariq1890 do you know why (4) might be happening?

Jul 24 '24 11:07 klueska

Actually, if I recall correctly the validator pod validates "everything", not just CUDA workloads. So I'm guessing the plugin is waiting for the toolkit to be validated, which never happens if the validator pod doesn't run it.

Jul 24 '24 11:07 klueska

Correct, the device-plugin does not run until the toolkit installation is validated by the validator pod.

Jul 31 '24 23:07 cdesiniotis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

Oct 30 '24 04:10 github-actions[bot]

This issue was automatically closed due to inactivity.

Nov 29 '24 04:11 github-actions[bot]