k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

gpu pod Pending

Open imenselmi opened this issue 1 year ago • 2 comments

I’m trying to prepare GPU worker nodes and enable GPU support on Kubernetes to use GPU nodes. I followed the steps in the README file link , but the pod always remains pending and is not working.Itried to use cuda 10 as tuto and also i changed to 12 and always not working.

1. Quick Debug Information

  • OS/Version : Ubuntu 22.04.4 LTS (Jammy Jellyfish)
  • cuda version : 12.2
  • NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 *server type : Nvidia L40S : link
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Docker version 27.1.1, build 6312585
  • Docker Compose version v2.29.1
  • CRI-O version: 1.24.6
  • nvidia-container-toolkit version (1.16.0-1).
  • kubectl version : Client Version: v1.30.3 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.0
  • minikube version: v1.33.1
  • helm Version:"v3.15.3"

2. Issue or feature description

Events: Type Reason Age From Message


Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

3.

kubectl get pods NAME READY STATUS RESTARTS AGE gpu-demo-vectoradd 0/1 Pending 0 12h gpu-operator-test 0/1 Pending 0 13h gpu-operator-test1 0/1 Pending 0 13h gpu-pod 0/1 Pending 0 13h

`kubectl describe pod gpu-pod Name: gpu-pod Namespace: default Priority: 0 Service Account: default Node: Labels: Annotations: Status: Pending IP:
IPs: Containers: cuda-container: Image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 Port: Host Port: Limits: nvidia.com/gpu: 1 Requests: nvidia.com/gpu: 1 Environment: Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ww9jw (ro) Conditions: Type Status PodScheduled False Volumes: kube-api-access-ww9jw: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: DownwardAPI: true QoS Class: BestEffort Node-Selectors: Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s nvidia.com/gpu:NoSchedule op=Exists Events: Type Reason Age From Message


Warning FailedScheduling 26m (x150 over 12h) default-scheduler 0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.`

imenselmi avatar Jul 30 '24 11:07 imenselmi

Did you deploy nvidia-device-plugin via helm? If so, which helm chart are you using? I am currently facing the same problem after upgrading from 0.14.0 to 0.16.1.

FelixMertin avatar Aug 09 '24 12:08 FelixMertin

@imenselmi / @FelixMertin could you please provide the logs for the k8s-device-plugin device-plugin container?

elezar avatar Aug 14 '24 11:08 elezar

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Nov 13 '24 04:11 github-actions[bot]

This issue was automatically closed due to inactivity.

github-actions[bot] avatar Dec 13 '24 04:12 github-actions[bot]