gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

unable to enable MPS strategy

Open thien-lm opened this issue 1 year ago • 1 comments

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04):
  • Kernel Version:
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker):
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS):
  • GPU Operator Version:

2. Issue or feature description

i want to set sharing mode of nvidia device plugin to mps, but it chaged to mig auto matically

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

4. Information to attach (optional if deemed irrelevant)

  • [ ] kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • [ ] kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • [ ] If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • [ ] If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • [ ] Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • [ ] containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

thien-lm avatar May 10 '24 10:05 thien-lm

@thien-lm can you please share more information on your deployment and the behavior you are observing?

cdesiniotis avatar Jul 11 '24 21:07 cdesiniotis

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed. To skip these checks, apply the "lifecycle/frozen" label.

github-actions[bot] avatar Nov 05 '25 00:11 github-actions[bot]

Hey! This issue has been open for over 90 days without any recent updates, and the original context may now be outdated.

A response was provided in this comment: https://github.com/NVIDIA/gpu-operator/issues/717#issuecomment-2223980748 and there has been no activity since then. Given that it has been more than a year now, and in order to keep the issue tracker clean and focused on current, actionable topics, I’m going to close this issue.

If you have any questions or need further assistance, please open a new issue with the relevant details so someone from the team can take a look.

karthikvetrivel avatar Nov 14 '25 16:11 karthikvetrivel