gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

gpu-operator with MIG won't work if GPU Node is deleted from cluster, reprovisioned, and then re-joined with the same name

Open rpardini opened this issue 1 year ago • 1 comments

Seems like the operator stores state, either in-memory or in the ClusterPolicy CRD in such a way that deleting a Node and re-joining it won't work. One has to manually label the Node with nvidia.com/gpu.dep-loy-mig-manager=true, otherwise it seems the operator skips over the MIG configuration completely and things fail.

The main symptom is that the validation pod fails with Failed to allocate device vector A (error code initialization error) -- which is a red herring.

What am I missing? Or is this just an unsupported scenario?

Thanks

rpardini avatar Jul 29 '24 15:07 rpardini

@rpardini You should not need to manually label the node to get the mig-manager deployed. Can you provide more details on the state of your cluster and the status of pods in the gpu-operator namespace? GFD should label your node with nvidia.com/mig.capable to indicate whether the GPUs on the node are MIG capable. Only then will the gpu-operator label the node with nvidia.com/gpu.deploy.mig-manager=true to deploy the mig-manager.

cdesiniotis avatar Aug 02 '24 21:08 cdesiniotis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 04 '25 22:11 github-actions[bot]

This issue has been open for over 90 days without recent updates, and the context may now be outdated.

To keep the issue tracker clean and focused on current and actionable topics, I am going to close this issue. If this issue is still relevant with the latest version of the NVIDIA GPU Operator, please feel free to reopen it or open a new one with updated details.

cdesiniotis avatar Nov 14 '25 22:11 cdesiniotis