k8s-device-plugin icon indicating copy to clipboard operation
k8s-device-plugin copied to clipboard

2 questions about setting GPU and scheduling GPU on k8s

Open chiehpower opened this issue 3 years ago • 1 comments

Dear all,

Recently I just dived into k8s this popular technique. As I request the GPU of pod to implement some DL tasks but I get confused about setting GPU and scheduling GPU.

I was using the microk8s to create a cluster and pods. Microk8s is very handy for users to enable relevant packages including kubeflow, gpu, etc.

  1. I am wondering if I use the microk8s to enable gpu, do I still have to particularly install the k8s-device-plugin of Nvidia manually?
  2. There is only one GPU device in this node, and I was trying to create a pod with GPU that I followed the instructions from k8s official website for testing. However, I encountered the known issue below.
  • microk8s version : microk8s --channel=1.21/beta --classic
  • host os : ubuntu 20.04
  • gpu : RTX 3060 (12G)
  • host gpu driver : 460.73.01

The content of yaml :

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: OnFailure
  containers:
  - image: nvcr.io/nvidia/cuda:11.2.2-devel-ubuntu18.04
    name: cuda
    resources:
      limits:
         nvidia.com/gpu: 1
         memory: "1G"
$ microk8s.kubectl create -f gpu_test.yaml
pod/gpu-pod created

$ microk8s.kubectl get pods 
NAME                                                         READY   STATUS    RESTARTS   AGE
gpu-operator-node-feature-discovery-master-dcf999dc8-p7s64   1/1     Running   0          58m
gpu-operator-node-feature-discovery-worker-mlcpt             1/1     Running   0          58m
gpu-operator-64df558567-xx6sx                                1/1     Running   0          58m
gpu-pod                                                      0/1     Pending   0          2m21s

$ microk8s.kubectl describe pods gpu-pod
Name:         gpu-pod
Namespace:    default
Priority:     0
Node:         <none>
Labels:       <none>
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  cuda:
    Image:      nvcr.io/nvidia/cuda:11.2.2-devel-ubuntu18.04
    Port:       <none>
    Host Port:  <none>
    Limits:
      memory:          1G
      nvidia.com/gpu:  1
    Requests:
      memory:          1G
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-lrhb7 (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  kube-api-access-lrhb7:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age    From               Message
  ----     ------            ----   ----               -------
  Warning  FailedScheduling  3m46s  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.
  Warning  FailedScheduling  3m45s  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

I tested many kind of yaml, but all got the same issue. Hence, I am wondering does the GPU plugin of k8s support only one GPU device and this GPU that is not fully freedom? Scheduling GPU is very important to me because I wanna implement the TensorRT and other deep learning tasks inside pods.

I would like to provide more detail information if there is any place not clear.

Thank you so much!

chiehpower avatar May 07 '21 02:05 chiehpower

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

github-actions[bot] avatar Feb 28 '24 04:02 github-actions[bot]