New strategy for MIG enable GPUs that will advertise no MIG slices if not created
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
- Kernel Version: Linux 5.15.0-97-lowlatency
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): KinD
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
The mixed strategy requires at least one MIG slice to be available on the device for K8s-device-plugin and GPU-feature-discovery pods to be in state Running. In our use case, we want to advertise MIG slices created dynamically and there would be scenarios in our setup where GPU devices will be MIG enabled but no actual MIG slices would be present on the device.
We want a mechanism or new strategy to handle dynamic MIG creation use cases.
@klueska ptal
@mrunalp @asm582 could you clarify what you would expect the behaviour to be?
I have reviewed the code and we should be generating SOME labels if mig-strategy=mixed even if all MIG-enabled devices are empty. Which labels are you looking for specifically?
Could you provide the logs of a GFD pod that is not in the running state for this configuration?
They want a new MIG strategy that allows a GPU to exist in MIG mode without having any MIG devices configured on it. Right now we error out if this is the case. The purpose being so that they can dynamically create MIG devices on such GPUs, kick the plugin to restart, and then start advertising those MIG devices as "mixed-strategy-style" resources.
To support this properly, the MIG manager will also need to be updated to:
- Allow one to trigger a MIG mode change via a label (as it does today); BUT
- NOT reapply this configuration if/when the MIG manager is restarted
In other words -- don't persist any MIG configs that get applied. Only apply a config at the moment it is requested.
Hi @elezar Do you need more details on this?
I chatted with @elezar and I am going to work on this later this week or next.
I was also thinking about my comment above a bit more, and I actually think we don't need any mig-manager changes. To avoid having the mig-manager reapply its "known" configuration after a reboot / restart, you simply have to remove the label.
Meaning that your controller should apply a label to set up a specific config (presumably with some GPUs set to MIG enabled and some set to MIG disabled), wait for the mig-manager to complete, and then remove the label. If no label is set, the mig-manager simply doesn't try and apply any config.
In the end, I decided to just change the mixed strategy to warn when no MIG devices are configured instead of erroring: https://github.com/NVIDIA/k8s-device-plugin/pull/806
Thank you for creating the fix for the mixed strategy. we are facing issues using MIG Manager:
- Start MIG manager with
mig.configasall-enabled - Let the operator configure GPUs with MIG enabled but have zero MIG slices
- When
nvidia.com/mig.config.state=successwe remove the labelmig.configfrom the node - (issue) MIG manager applies
nvidia.com/mig.config=all-disabledlabels and disables MIG on the GPU.
We need a setup where the MIG partitions are untouched when the MIG manager pod restarts. @klueska @elezar
The mig-manager itself doesn't apply any "default" config. It only applies a change if a label is set. If no label is set, it will just sit in a wait loop, waiting for one to be set with some config.
@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?
One way to work around this (and possibly even the "right" solution going forward) would be to deploy the operator with nvidia.com/mig.config=all-enabled on the nodes you want configured that way, wait for nvidia.com/mig.config.state=success on those nodes, and then disable the mig-manager altogether on those nodes by setting the label nvidia.com/gpu.deploy.mig-manager=false.
@cdesiniotis does the operator force the mig.config label to be set to all-disabled if it gets unset by an external user?
Yes, see https://github.com/NVIDIA/gpu-operator/blob/main/controllers/state_manager.go#L538-L546
@asm582 if you set migManager.config.default="", then the operator will not apply a default label. So after you remove the mig.config label, the label should remain unset.
Let me list some of my findings. Note though that this is OpenShift - not vanilla K8s. The NVIDIA GPU operator is 24.3.
- When the cluster policy is deployed with
migManager.config.default="", no MIG manager pod is created. So if I want to enable MIG, for instance, I have to explicitly tell the operator to deploy the MIG managerkubectl label node $node nvidia.com/gpu.deploy.mig-manager=true --overwrite. Then the MIG manager starts and I can enable MIG without creating any sliceskubectl label node $node nvidia.com/mig.config=all-enabled --overwrite(MIG strategy ismixed). - When there are no MIG slices while MIG is enabled on the GPU, the CUDA validator pod will
CrashLoopBackOffand the operator validator pod will keep waiting for initialization. This makes sense, but is a bit annoying. More on this later. - Now, when MIG is enabled through the MIG manager, we can disable the manager as suggested, so that it doesn't interfere with other ways to manage MIG slices. It works.
kubectl label node $node nvidia.com/gpu.deploy.mig-manager=false --overwrite
kubectl label node $node nvidia.com/mig.config-
- I tried to disable the validator to get rid of the error status:
kubectl label node $node nvidia.com/gpu.deploy.operator-validator=false --overwrite. This removes at least the operator validator pod. However, if the operator validator is disabled before the device plugin has a chance to start, the plugin will never run. This should be kept in mind. - I deployed a workload pod that requested
nvidia.com/mig-1g.5gb: 1and created a MIG slice manually to satisfy that, usingnvidia-smi. Just for testing. I had to delete the device plugin pod and let it be re-created, so that it picks up the MIG changes. The pod had remained pending until the device plugin advertised thenvidia.com/mig-1g.5gbcapacity. After that the workload pod (vectoradd) ran successfully. - After deleting the workload pod, and in line with the manual testing, I tried to delete the MIG partition (
nvidia-smi mig -dgi -gi 9ornvidia-smi mig -dgi). However, I got this error:
Unable to destroy GPU instance ID 9 from GPU 0: In use by another client
Failed to destroy GPU instances: In use by another client
command terminated with exit code 19
The error persisted no matter what "client" I tried to stop. I disabled DCGM/DCGM exporter, the validators, even the device plugin, but nothing has helped. I believe at this point I can only reset the GPU before I can delete the MIG slice.
Thanks @empovit. @klueska @cdesiniotis any pointers that can help us with step 6 of the above comment?
You cannot delete a gi until you have deleted its ci.
@klueska thanks! You're absolutely right. So one problem less :) And a major one.
Everything except 4 seems to be expected (if not ideal).
@cdesiniotis @tariq1890 do you know why (4) might be happening?
Actually, if I recall correctly the validator pod validates "everything", not just CUDA workloads. So I'm guessing the plugin is waiting for the toolkit to be validated, which never happens if the validator pod doesn't run it.
Correct, the device-plugin does not run until the toolkit installation is validated by the validator pod.
This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.
This issue was automatically closed due to inactivity.