pangeo-cloud-federation
pangeo-cloud-federation copied to clipboard
Adding GPUs
I've started down the GPU rabbit hole. My initial efforts are focusing on getting GPUs added to the GCP hubs (hydro/ocean). Things have not worked first try so I'm hoping to solicit some help from a few folks (@jacobtomlinson and @consideRatio) who have more experience in this area. I'll walk through the steps I've taken and then explain where things aren't working.
Add a node pool to our GKE cluster (dev-pangeo-io)
jupyter_machine_type="n1-standard-4"
jupyter_taints="hub.jupyter.org_dedicated=user:NoSchedule"
jupyter_labels="hub.jupyter.org/node-purpose=user"
gcloud container node-pools create jupyter-gpu-pool \
--cluster=${cluster_name} \
--machine-type=${jupyter_machine_type} \
--disk-type=pd-ssd \
--zone=${zone} \
--num-nodes=1 \
--node-taints ${jupyter_taints} \
--node-labels ${jupyter_labels} \
--accelerator type=nvidia-tesla-t4,count=1
This results in a node that has this configuration:
Name: gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/fluentd-ds-ready=true
beta.kubernetes.io/instance-type=n1-standard-4
beta.kubernetes.io/os=linux
cloud.google.com/gke-accelerator=nvidia-tesla-t4
cloud.google.com/gke-nodepool=jupyter-gpu-pool
cloud.google.com/gke-os-distribution=cos
failure-domain.beta.kubernetes.io/region=us-central1
failure-domain.beta.kubernetes.io/zone=us-central1-b
hub.jupyter.org/node-purpose=user
kubernetes.io/hostname=gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754
Annotations: container.googleapis.com/instance_id: 5059798487109562556
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 02 Oct 2019 09:21:47 -0700
Taints: hub.jupyter.org_dedicated=user:NoSchedule
nvidia.com/gpu=present:NoSchedule
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
FrequentUnregisterNetDevice False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 NoFrequentUnregisterNetDevice node is functioning properly
FrequentKubeletRestart False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 NoFrequentContainerdRestart containerd is functioning properly
KernelDeadlock False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 FilesystemIsNotReadOnly Filesystem is not read-only
CorruptDockerOverlay2 False Wed, 02 Oct 2019 09:23:15 -0700 Wed, 02 Oct 2019 09:22:13 -0700 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
NetworkUnavailable False Wed, 02 Oct 2019 09:21:48 -0700 Wed, 02 Oct 2019 09:21:48 -0700 RouteCreated NodeController create implicit route
OutOfDisk False Wed, 02 Oct 2019 09:24:03 -0700 Wed, 02 Oct 2019 09:21:47 -0700 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Wed, 02 Oct 2019 09:24:03 -0700 Wed, 02 Oct 2019 09:21:47 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 02 Oct 2019 09:24:03 -0700 Wed, 02 Oct 2019 09:21:47 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 02 Oct 2019 09:24:03 -0700 Wed, 02 Oct 2019 09:21:47 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 02 Oct 2019 09:24:03 -0700 Wed, 02 Oct 2019 09:22:13 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.128.0.58
ExternalIP: 35.226.29.63
InternalDNS: gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754.c.pangeo-181919.internal
Hostname: gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754.c.pangeo-181919.internal
Capacity:
attachable-volumes-gce-pd: 128
cpu: 4
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 15399364Ki
pods: 110
Allocatable:
attachable-volumes-gce-pd: 128
cpu: 3920m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 12700100Ki
pods: 110
System Info:
Machine ID: 3d11ab74f52e4c93aefe3d70748c365e
System UUID: 3D11AB74-F52E-4C93-AEFE-3D70748C365E
Boot ID: 9b143f16-184b-480c-b310-170ca2cae575
Kernel Version: 4.14.127+
OS Image: Container-Optimized OS from Google
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://17.3.2
Kubelet Version: v1.12.8-gke.10
Kube-Proxy Version: v1.12.8-gke.10
PodCIDR: 10.32.2.0/24
ProviderID: gce://pangeo-181919/us-central1-b/gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754
Non-terminated Pods: (4 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits
--------- ---- ------------ ---------- --------------- -------------
kube-system fluentd-gcp-v3.1.1-9d5nq 100m (2%) 1 (25%) 200Mi (1%) 500Mi (4%)
kube-system kube-proxy-gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 100m (2%) 0 (0%) 0 (0%) 0 (0%)
kube-system nvidia-driver-installer-c6r9m 150m (3%) 0 (0%) 0 (0%) 0 (0%)
kube-system nvidia-gpu-device-plugin-gpbf7 50m (1%) 50m (1%) 10Mi (0%) 10Mi (0%)
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 400m (10%) 1050m (26%)
memory 210Mi (1%) 510Mi (4%)
attachable-volumes-gce-pd 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 2m24s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Starting kubelet.
Normal NodeHasSufficientDisk 2m24s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientDisk
Normal NodeHasSufficientMemory 2m24s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 2m24s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 2m24s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 2m24s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Updated Node Allocatable limit across pods
Normal NodeReady 2m23s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeReady
Normal Starting 2m22s kube-proxy, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Starting kube-proxy.
Normal NodeHasSufficientDisk 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientDisk
Normal Starting 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Starting kubelet.
Normal NodeHasSufficientMemory 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeHasSufficientPID
Warning Rebooted 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 has been rebooted, boot id: 9b143f16-184b-480c-b310-170ca2cae575
Normal NodeNotReady 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeNotReady
Normal NodeAllocatableEnforced 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Updated Node Allocatable limit across pods
Normal NodeReady 118s kubelet, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Node gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 status is now: NodeReady
Normal Starting 116s kube-proxy, gke-dev-pangeo-io-cl-jupyter-gpu-pool-026fa47a-q754 Starting kube-proxy.
Configure a new single user profile to use this GPU pool:
https://github.com/pangeo-data/pangeo-cloud-federation/blob/f9610e72da39bfcbebdb1645b201005bc0467e7d/deployments/hydro/config/common.yaml#L88-L96
Try to log into hydro.pangoe.io using this new profile
This results in a pending pod that is NOT scheduled that looks like this:
Name: jupyter-jhamman
Namespace: hydro-prod
Priority: 0
PriorityClassName: hydro-prod-default-priority
Node: <none>
Labels: app=jupyterhub
chart=jupyterhub-0.9-4300ff5
component=singleuser-server
heritage=jupyterhub
hub.jupyter.org/network-access-hub=true
release=hydro-prod
Annotations: hub.jupyter.org/username: jhamman
Status: Pending
IP:
Init Containers:
volume-mount-hack:
Image: busybox
Port: <none>
Host Port: <none>
Command:
sh
-c
id && chown 1000:1000 /home/jovyan && ls -lhd /home/jovyan
Environment: <none>
Mounts:
/home/jovyan from home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from daskkubernetes-token-npp9q (ro)
Containers:
notebook:
Image: pangeo/ml-notebook:latest
Port: 8888/TCP
Host Port: 0/TCP
Args:
jupyterhub-singleuser
--ip=0.0.0.0
--port=8888
--NotebookApp.default_url=/lab
Limits:
cpu: 4
memory: 16106127360
nvidia.com/gpu: 1
Requests:
cpu: 3500m
memory: 15032385536
nvidia.com/gpu: 1
Environment:
JUPYTERHUB_API_TOKEN: XXXXXXX
JPY_API_TOKEN: XXXXXXX
JUPYTERHUB_ADMIN_ACCESS: 1
JUPYTERHUB_CLIENT_ID: jupyterhub-user-jhamman
JUPYTERHUB_HOST:
JUPYTERHUB_OAUTH_CALLBACK_URL: /user/jhamman/oauth_callback
JUPYTERHUB_USER: jhamman
JUPYTERHUB_SERVER_NAME:
JUPYTERHUB_API_URL: http://10.4.10.184:8081/hub/api
JUPYTERHUB_ACTIVITY_URL: http://10.4.10.184:8081/hub/api/users/jhamman/activity
JUPYTERHUB_BASE_URL: /
JUPYTERHUB_SERVICE_PREFIX: /user/jhamman/
MEM_LIMIT: 16106127360
MEM_GUARANTEE: 15032385536
CPU_LIMIT: 4.0
CPU_GUARANTEE: 3.5
JUPYTER_IMAGE_SPEC: pangeo/ml-notebook:latest
JUPYTER_IMAGE: pangeo/ml-notebook:latest
Mounts:
/home/jovyan from home (rw)
/var/run/secrets/kubernetes.io/serviceaccount from daskkubernetes-token-npp9q (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
home:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: home-nfs
ReadOnly: false
daskkubernetes-token-npp9q:
Type: Secret (a volume populated by a Secret)
SecretName: daskkubernetes-token-npp9q
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: hub.jupyter.org/dedicated=user:NoSchedule
hub.jupyter.org_dedicated=user:NoSchedule
node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
nvidia.com/gpu:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m16s (x25 over 2m46s) hydro-prod-user-scheduler 0/7 nodes are available: 1 Insufficient memory, 2 Insufficient nvidia.com/gpu, 5 node(s) didn't match node selector.
Normal NotTriggerScaleUp 119s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient memory, 3 Insufficient nvidia.com/gpu, 1 Insufficient cpu
Normal NotTriggerScaleUp 43s (x4 over 98s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 Insufficient cpu, 1 Insufficient memory, 3 Insufficient nvidia.com/gpu
Normal NotTriggerScaleUp 10s (x7 over 2m10s) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 3 Insufficient nvidia.com/gpu, 1 Insufficient cpu, 1 Insufficient memory
I should note that I also configured the nvidia-driver daemonset following @consideRatio's step 5 here: https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994.
It seems to me the pod is not scheduling due some kubernetes constraint. I'm not sure what this could be but perhaps I'm missing something simple.
cc @jsadler2, @rsignell-usgs
@jhamman things I'd investigate:
- Logs of the daemonset to install the drivers
- Is the nvidia driver installer daemonset made for the node image type (Container-Optimized OS from Google), or is it for Ubuntu, I recall there where two options.
- Can we verify that ...
Oh...
It's a memory issue I think
# Node allocatable memory
memory: 12700100Ki
# Pod request of memory
memory: 15032385536
# Pod schedule failure message
0/7 nodes are available: 1 Insufficient memory, 2 Insufficient nvidia.com/gpu, 5 node(s) didn't match node selector.
Looking at the node output it doesn't seem to be aware that it has nvidia.com/gpu
resources. This is likely a driver install issue.
The drivers section of this page may be helpful.
It could also be two issues, to complement what @jacobtomlinson describes, I would also think there would be a nvidia.com/gpu resource listed, but there wasn't.
Capacity:
attachable-volumes-gce-pd: 128
cpu: 4
ephemeral-storage: 98868448Ki
hugepages-2Mi: 0
memory: 15399364Ki
pods: 110
Allocatable:
attachable-volumes-gce-pd: 128
cpu: 3920m
ephemeral-storage: 47093746742
hugepages-2Mi: 0
memory: 12700100Ki
pods: 110
Okay, I'm pretty sure this is a driver issue. I've installed the daemonset as described in jupyterhub/zero-to-jupyterhub-k8s#994 but something has gone seriously wrong:
$ kubectl logs -n kube-system ds/nvidia-driver-installer -c nvidia-driver-installer -f
...
[INFO 2019-10-03 16:52:00 UTC] Modifying kernel version magic string in source files
/
[INFO 2019-10-03 16:52:00 UTC] Running Nvidia installer
/usr/local/nvidia /
[INFO 2019-10-03 16:52:00 UTC] Downloading Nvidia installer from https://us.download.nvidia.com/...
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 384.145..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
ERROR: Unable to load the kernel module 'nvidia.ko'. This happens most
frequently when this kernel module was built against the wrong or
improperly configured kernel sources, with a version of gcc that
differs from the one used to build the target kernel, or if a driver
such as rivafb, nvidiafb, or nouveau is present and prevents the
NVIDIA kernel module from obtaining ownership of the NVIDIA graphics
device(s), or no NVIDIA GPU installed in this system is supported by
this NVIDIA Linux graphics driver release.
Please see the log entries 'Kernel module load error' and 'Kernel
messages' at the end of the file
'/usr/local/nvidia/nvidia-installer.log' for more information.
ERROR: Installation has failed. Please see the file
'/usr/local/nvidia/nvidia-installer.log' for details. You may find
suggestions on fixing installation problems in the README available
on the Linux driver download page at www.nvidia.com.
I note you are using version 384.145. I think what's wrong is driver version incompatability, since you are using nvidia-tesla-t4
while I wrote the initial guide to work with NVIDIA K80 GPUs.
[...] or no NVIDIA GPU installed in this system is supported by
this NVIDIA Linux graphics driver release.
Inspecting this: https://docs.nvidia.com/deploy/cuda-compatibility/index.html, and knowing that the the Tesla T4 is apparently in the Turing "Hardware Generation", I conclude you need a different driver. I'd try 418.39+ and later also pin cudatoolkit 10.1 instead of 9.0.
Try:
kubectl patch daemonset -n kube-system nvidia-driver-installer --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"418.39"}]}]}}}}'
@jhamman and @yuvipanda - I think this got resolved last week, correct?
As I could not find the potential solution mentioned on the previous comment, I am posting my own experience herein, as it may help others.
I am using AWS and their AMI dedicated for GPU support, so the drivers are already installed in the image - still, the issue with the missing nvidia.com/gpu resource was the same as quoted below.
(...) I would also think there would be a nvidia.com/gpu resource listed, but there wasn't.
Capacity: attachable-volumes-gce-pd: 128 cpu: 4 ephemeral-storage: 98868448Ki hugepages-2Mi: 0 memory: 15399364Ki pods: 110 Allocatable: attachable-volumes-gce-pd: 128 cpu: 3920m ephemeral-storage: 47093746742 hugepages-2Mi: 0 memory: 12700100Ki pods: 110
And it seems to be related to the user-notebook labels, taints and tags I was adding to my GPU node group:
labels:
nvidia.com/gpu: present
# hub.jupyter.org/node-purpose: user
taints:
nvidia.com/gpu: "present:NoSchedule"
# hub.jupyter.org/dedicated: "user:NoSchedule"
tags:
k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: present
k8s.io/cluster-autoscaler/node-template/taint/nvidia.com/gpu: "present:NoSchedule"
# k8s.io/cluster-autoscaler/node-template/label/hub.jupyter.org/node-purpose: user
# k8s.io/cluster-autoscaler/node-template/taint/hub.jupyter.org/dedicated: "user:NoSchedule"
By commenting out the user-notebook-related lines (as shown above), the nvidia.com/gpu resource would be finally listed (though I am not sure why this happens).
Unfortunately, this led me to another issue - so far, I've been forcing the userPods and corePods to match the node purpose (nodeAffinity: matchNodePurpose: require). And, by disabling the related labels for user-notebook, I wasn't able to select the GPU instance under my jhub profile list anymore.
After rolling back the matchNodePurpose settings to "prefer", every second helm upgrade I run, the continuous-image-puller pods get stuck, and so does my hub pod as well. In this case, I have to set them back to "require", upgrade the helm, and then I am able to upgrade them back to "prefer".
I am not sure if I am doing something wrong - but this is my current workaround to get the GPU instance running for now. If you have any ideas or suggestions on this scenario, please let me know.
PS: I would also like to thank the whole pangeo and jupyterhub community for all the efforts towards the k8s solutions. All the documentation and the GitHub issues have been extremely helpful for me.