pangeo-cloud-federation icon indicating copy to clipboard operation
pangeo-cloud-federation copied to clipboard

Adding GPUs

Open jhamman opened this issue 5 years ago • 7 comments

I've started down the GPU rabbit hole. My initial efforts are focusing on getting GPUs added to the GCP hubs (hydro/ocean). Things have not worked first try so I'm hoping to solicit some help from a few folks (@jacobtomlinson and @consideRatio) who have more experience in this area. I'll walk through the steps I've taken and then explain where things aren't working.

Add a node pool to our GKE cluster (dev-pangeo-io)

jupyter_machine_type="n1-standard-4"
jupyter_taints="hub.jupyter.org_dedicated=user:NoSchedule"
jupyter_labels="hub.jupyter.org/node-purpose=user"
gcloud container node-pools create jupyter-gpu-pool \
    --cluster=${cluster_name} \
    --machine-type=${jupyter_machine_type} \
    --disk-type=pd-ssd \
    --zone=${zone} \
    --num-nodes=1 \
    --node-taints ${jupyter_taints} \
    --node-labels ${jupyter_labels} \
    --accelerator type=nvidia-tesla-t4,count=1

This results in a node that has this configuration:

Name:               gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754
Roles:              <none>
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/fluentd-ds-ready=true
                    beta.kubernetes.io/instance-type=n1-standard-4
                    beta.kubernetes.io/os=linux
                    cloud.google.com/gke-accelerator=nvidia-tesla-t4
                    cloud.google.com/gke-nodepool=jupyter-gpu-pool
                    cloud.google.com/gke-os-distribution=cos
                    failure-domain.beta.kubernetes.io/region=us-central1
                    failure-domain.beta.kubernetes.io/zone=us-central1-b
                    hub.jupyter.org/node-purpose=user
                    kubernetes.io/hostname=gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754
Annotations:        container.googleapis.com/instance_id: 5059798487109562556
                    node.alpha.kubernetes.io/ttl: 0
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 02 Oct 2019 09:21:47 -0700
Taints:             hub.jupyter.org_dedicated=user:NoSchedule
                    nvidia.com/gpu=present:NoSchedule
Unschedulable:      false
Conditions:
  Type                          Status  LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------  -----------------                 ------------------                ------                          -------
  FrequentUnregisterNetDevice   False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   NoFrequentUnregisterNetDevice   node is functioning properly
  FrequentKubeletRestart        False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   NoFrequentContainerdRestart     containerd is functioning properly
  KernelDeadlock                False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   FilesystemIsNotReadOnly         Filesystem is not read-only
  CorruptDockerOverlay2         False   Wed, 02 Oct 2019 09:23:15 -0700   Wed, 02 Oct 2019 09:22:13 -0700   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  NetworkUnavailable            False   Wed, 02 Oct 2019 09:21:48 -0700   Wed, 02 Oct 2019 09:21:48 -0700   RouteCreated                    NodeController create implicit route
  OutOfDisk                     False   Wed, 02 Oct 2019 09:24:03 -0700   Wed, 02 Oct 2019 09:21:47 -0700   KubeletHasSufficientDisk        kubelet has sufficient disk space available
  MemoryPressure                False   Wed, 02 Oct 2019 09:24:03 -0700   Wed, 02 Oct 2019 09:21:47 -0700   KubeletHasSufficientMemory      kubelet has sufficient memory available
  DiskPressure                  False   Wed, 02 Oct 2019 09:24:03 -0700   Wed, 02 Oct 2019 09:21:47 -0700   KubeletHasNoDiskPressure        kubelet has no disk pressure
  PIDPressure                   False   Wed, 02 Oct 2019 09:24:03 -0700   Wed, 02 Oct 2019 09:21:47 -0700   KubeletHasSufficientPID         kubelet has sufficient PID available
  Ready                         True    Wed, 02 Oct 2019 09:24:03 -0700   Wed, 02 Oct 2019 09:22:13 -0700   KubeletReady                    kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:   10.128.0.58
  ExternalIP:   35.226.29.63
  InternalDNS:  gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754.c.pangeo-181919.internal
  Hostname:     gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754.c.pangeo-181919.internal
Capacity:
 attachable-volumes-gce-pd:  128
 cpu:                        4
 ephemeral-storage:          98868448Ki
 hugepages-2Mi:              0
 memory:                     15399364Ki
 pods:                       110
Allocatable:
 attachable-volumes-gce-pd:  128
 cpu:                        3920m
 ephemeral-storage:          47093746742
 hugepages-2Mi:              0
 memory:                     12700100Ki
 pods:                       110
System Info:
 Machine ID:                 3d11ab74f52e4c93aefe3d70748c365e
 System UUID:                3D11AB74-F52E-4C93-AEFE-3D70748C365E
 Boot ID:                    9b143f16-184b-480c-b310-170ca2cae575
 Kernel Version:             4.14.127+
 OS Image:                   Container-Optimized OS from Google
 Operating System:           linux
 Architecture:               amd64
 Container Runtime Version:  docker://17.3.2
 Kubelet Version:            v1.12.8-gke.10
 Kube-Proxy Version:         v1.12.8-gke.10
PodCIDR:                     10.32.2.0/24
ProviderID:                  gce://pangeo-181919/us-central1-b/gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754
Non-terminated Pods:         (4 in total)
  Namespace                  Name                                                              CPU Requests  CPU Limits  Memory Requests  Memory Limits
  ---------                  ----                                                              ------------  ----------  ---------------  -------------
  kube-system                fluentd-gcp-v3.1.1-9d5nq                                          100m (2%)     1 (25%)     200Mi (1%)       500Mi (4%)
  kube-system                kube-proxy-gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754    100m (2%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                nvidia-driver-installer-c6r9m                                     150m (3%)     0 (0%)      0 (0%)           0 (0%)
  kube-system                nvidia-gpu-device-plugin-gpbf7                                    50m (1%)      50m (1%)    10Mi (0%)        10Mi (0%)
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                   Requests    Limits
  --------                   --------    ------
  cpu                        400m (10%)  1050m (26%)
  memory                     210Mi (1%)  510Mi (4%)
  attachable-volumes-gce-pd  0           0
Events:
  Type     Reason                   Age    From                                                             Message
  ----     ------                   ----   ----                                                             -------
  Normal   Starting                 2m24s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Starting kubelet.
  Normal   NodeHasSufficientDisk    2m24s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientDisk
  Normal   NodeHasSufficientMemory  2m24s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    2m24s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     2m24s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientPID
  Normal   NodeAllocatableEnforced  2m24s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Updated Node Allocatable limit across pods
  Normal   NodeReady                2m23s  kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeReady
  Normal   Starting                 2m22s  kube-proxy, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754  Starting kube-proxy.
  Normal   NodeHasSufficientDisk    118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientDisk
  Normal   Starting                 118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Starting kubelet.
  Normal   NodeHasSufficientMemory  118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientPID
  Warning  Rebooted                 118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 has been rebooted, boot id: 9b143f16-184b-480c-b310-170ca2cae575
  Normal   NodeNotReady             118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeNotReady
  Normal   NodeAllocatableEnforced  118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Updated Node Allocatable limit across pods
  Normal   NodeReady                118s   kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754     Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeReady
  Normal   Starting                 116s   kube-proxy, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754  Starting kube-proxy.

Configure a new single user profile to use this GPU pool:

https://github.com/pangeo-data/pangeo-cloud-federation/blob/f9610e72da39bfcbebdb1645b201005bc0467e7d/deployments/hydro/config/common.yaml#L88-L96

Try to log into hydro.pangoe.io using this new profile

image

This results in a pending pod that is NOT scheduled that looks like this:

Name:               jupyter-jhamman
Namespace:          hydro-prod
Priority:           0
PriorityClassName:  hydro-prod-default-priority
Node:               <none>
Labels:             app=jupyterhub
                    chart=jupyterhub-0.9-4300ff5
                    component=singleuser-server
                    heritage=jupyterhub
                    hub.jupyter.org/network-access-hub=true
                    release=hydro-prod
Annotations:        hub.jupyter.org/username: jhamman
Status:             Pending
IP:
Init Containers:
  volume-mount-hack:
    Image:      busybox
    Port:       <none>
    Host Port:  <none>
    Command:
      sh
      -c
      id && chown 1000:1000 /home/jovyan && ls -lhd /home/jovyan
    Environment:  <none>
    Mounts:
      /home/jovyan from home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from daskkubernetes-token-npp9q (ro)
Containers:
  notebook:
    Image:      pangeo/ml-notebook:latest
    Port:       8888/TCP
    Host Port:  0/TCP
    Args:
      jupyterhub-singleuser
      --ip=0.0.0.0
      --port=8888
      --NotebookApp.default_url=/lab
    Limits:
      cpu:             4
      memory:          16106127360
      nvidia.com/gpu:  1
    Requests:
      cpu:             3500m
      memory:          15032385536
      nvidia.com/gpu:  1
    Environment:
      JUPYTERHUB_API_TOKEN:           XXXXXXX
      JPY_API_TOKEN:                  XXXXXXX
      JUPYTERHUB_ADMIN_ACCESS:        1
      JUPYTERHUB_CLIENT_ID:           jupyterhub-user-jhamman
      JUPYTERHUB_HOST:
      JUPYTERHUB_OAUTH_CALLBACK_URL:  /user/jhamman/oauth_callback
      JUPYTERHUB_USER:                jhamman
      JUPYTERHUB_SERVER_NAME:
      JUPYTERHUB_API_URL:             http://10.4.10.184:8081/hub/api
      JUPYTERHUB_ACTIVITY_URL:        http://10.4.10.184:8081/hub/api/users/jhamman/activity
      JUPYTERHUB_BASE_URL:            /
      JUPYTERHUB_SERVICE_PREFIX:      /user/jhamman/
      MEM_LIMIT:                      16106127360
      MEM_GUARANTEE:                  15032385536
      CPU_LIMIT:                      4.0
      CPU_GUARANTEE:                  3.5
      JUPYTER_IMAGE_SPEC:             pangeo/ml-notebook:latest
      JUPYTER_IMAGE:                  pangeo/ml-notebook:latest
    Mounts:
      /home/jovyan from home (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from daskkubernetes-token-npp9q (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  home:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  home-nfs
    ReadOnly:   false
  daskkubernetes-token-npp9q:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  daskkubernetes-token-npp9q
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     hub.jupyter.org/dedicated=user:NoSchedule
                 hub.jupyter.org_dedicated=user:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
                 nvidia.com/gpu:NoSchedule
Events:
  Type     Reason             Age                     From                       Message
  ----     ------             ----                    ----                       -------
  Warning  FailedScheduling   2m16s (x25 over 2m46s)  hydro-prod-user-scheduler  0/7 nodes are available: 1 Insufficient memory, 2 Insufficient nvidia.com/gpu, 5 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  119s                    cluster-autoscaler         pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory, 3 Insufficient nvidia.com/gpu, 1 Insufficient cpu
  Normal   NotTriggerScaleUp  43s (x4 over 98s)       cluster-autoscaler         pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory, 3 Insufficient nvidia.com/gpu
  Normal   NotTriggerScaleUp  10s (x7 over 2m10s)     cluster-autoscaler         pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 Insufficient nvidia.com/gpu, 1 Insufficient cpu, 1 Insufficient memory

I should note that I also configured the nvidia-driver daemonset following @consideRatio's step 5 here: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994.

It seems to me the pod is not scheduling due some kubernetes constraint. I'm not sure what this could be but perhaps I'm missing something simple.

cc @jsadler2, @rsignell-usgs

jhamman avatar Oct 02 '19 16:10 jhamman

@jhamman things I'd investigate:

  1. Logs of the daemonset to install the drivers
  2. Is the nvidia driver installer daemonset made for the node image type (Container-Optimized OS from Google), or is it for Ubuntu, I recall there where two options.
  3. Can we verify that ...

Oh...

It's a memory issue I think

# Node allocatable memory
memory:          12700100Ki
# Pod request of memory
memory:          15032385536
# Pod schedule failure message
0/7 nodes are available: 1 Insufficient memory, 2 Insufficient nvidia.com/gpu, 5 node(s) didn't match node selector.

consideRatio avatar Oct 02 '19 19:10 consideRatio

Looking at the node output it doesn't seem to be aware that it has nvidia.com/gpu resources. This is likely a driver install issue.

The drivers section of this page may be helpful.

jacobtomlinson avatar Oct 03 '19 08:10 jacobtomlinson

It could also be two issues, to complement what @jacobtomlinson describes, I would also think there would be a nvidia.com/gpu resource listed, but there wasn't.

Capacity:
 attachable-volumes-gce-pd:  128
 cpu:                        4
 ephemeral-storage:          98868448Ki
 hugepages-2Mi:              0
 memory:                     15399364Ki
 pods:                       110
Allocatable:
 attachable-volumes-gce-pd:  128
 cpu:                        3920m
 ephemeral-storage:          47093746742
 hugepages-2Mi:              0
 memory:                     12700100Ki
 pods:                       110

consideRatio avatar Oct 03 '19 14:10 consideRatio

Okay, I'm pretty sure this is a driver issue. I've installed the daemonset as described in jupyterhub/zero-to-jupyterhub-k8s#994 but something has gone seriously wrong:

$ kubectl logs -n kube-system ds/nvidia-driver-installer -c nvidia-driver-installer -f
...
[INFO    2019-10-03 16:52:00 UTC] Modifying kernel version magic string in source files
/
[INFO    2019-10-03 16:52:00 UTC] Running Nvidia installer
/usr/local/nvidia /
[INFO    2019-10-03 16:52:00 UTC] Downloading Nvidia installer from https://us.download.nvidia.com/...
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.145..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most
       frequently when this kernel module was built against the wrong or
       improperly configured kernel sources, with a version of gcc that
       differs from the one used to build the target kernel, or if a driver
       such as rivafb, nvidiafb, or nouveau is present and prevents the
       NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
       device(s), or no NVIDIA GPU installed in this system is supported by
       this NVIDIA Linux graphics driver release.

       Please see the log entries 'Kernel module load error' and 'Kernel
       messages' at the end of the file
       '/usr/local/nvidia/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file
       '/usr/local/nvidia/nvidia-installer.log' for details.  You may find
       suggestions on fixing installation problems in the README available
       on the Linux driver download page at www.nvidia.com.

jhamman avatar Oct 03 '19 18:10 jhamman

I note you are using version 384.145. I think what's wrong is driver version incompatability, since you are using nvidia-tesla-t4 while I wrote the initial guide to work with NVIDIA K80 GPUs.

       [...] or no NVIDIA GPU installed in this system is supported by
       this NVIDIA Linux graphics driver release.

Inspecting this: https://docs.nvidia.com/deploy/cuda-compatibility/index.html, and knowing that the the Tesla T4 is apparently in the Turing "Hardware Generation", I conclude you need a different driver. I'd try 418.39+ and later also pin cudatoolkit 10.1 instead of 9.0.

Try:

kubectl patch daemonset -n kube-system nvidia-driver-installer --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"418.39"}]}]}}}}'

consideRatio avatar Oct 03 '19 18:10 consideRatio

@jhamman and @yuvipanda - I think this got resolved last week, correct?

scottyhq avatar Dec 23 '19 06:12 scottyhq

As I could not find the potential solution mentioned on the previous comment, I am posting my own experience herein, as it may help others.

I am using AWS and their AMI dedicated for GPU support, so the drivers are already installed in the image - still, the issue with the missing nvidia.com/gpu resource was the same as quoted below.

(...) I would also think there would be a nvidia.com/gpu resource listed, but there wasn't.

Capacity:
 attachable-volumes-gce-pd:  128
 cpu:                        4
 ephemeral-storage:          98868448Ki
 hugepages-2Mi:              0
 memory:                     15399364Ki
 pods:                       110
Allocatable:
 attachable-volumes-gce-pd:  128
 cpu:                        3920m
 ephemeral-storage:          47093746742
 hugepages-2Mi:              0
 memory:                     12700100Ki
 pods:                       110

And it seems to be related to the user-notebook labels, taints and tags I was adding to my GPU node group:

labels:
  nvidia.com/gpu: present
  # hub.jupyter.org/node-purpose: user
taints:
  nvidia.com/gpu: "present:NoSchedule"
  # hub.jupyter.org/dedicated: "user:NoSchedule"
tags:
  k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: present
  k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu: "present:NoSchedule"
  # k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/node-purpose: user
  # k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org/dedicated: "user:NoSchedule"

By commenting out the user-notebook-related lines (as shown above), the nvidia.com/gpu resource would be finally listed (though I am not sure why this happens).

Unfortunately, this led me to another issue - so far, I've been forcing the userPods and corePods to match the node purpose (nodeAffinity: matchNodePurpose: require). And, by disabling the related labels for user-notebook, I wasn't able to select the GPU instance under my jhub profile list anymore.

After rolling back the matchNodePurpose settings to "prefer", every second helm upgrade I run, the continuous-image-puller pods get stuck, and so does my hub pod as well. In this case, I have to set them back to "require", upgrade the helm, and then I am able to upgrade them back to "prefer".

I am not sure if I am doing something wrong - but this is my current workaround to get the GPU instance running for now. If you have any ideas or suggestions on this scenario, please let me know.

PS: I would also like to thank the whole pangeo and jupyterhub community for all the efforts towards the k8s solutions. All the documentation and the GitHub issues have been extremely helpful for me.

GuilhermeZimeo avatar Jan 27 '20 13:01 GuilhermeZimeo