trafficstars

GPU powered machine learning on GKE

To enable GPUs on GKE, this is what I've done. Note that this post is a Work In Progress and will be edited from time to time. To see when the last edit was made, see the header of this post.

Prerequisite knowledge

Kubernetes nodes, pods and daemonsets

A node is represents actual hardware on the cloud, a pod represents something running on a node, and a daemonset will ensure one pod running something is created for each node. If you lack knowledge about kubernetes, I'd recommend learning more at their concepts page.

Bonus knowledge:

This video provides a background allowing you to understand why additional steps is required for this to work: https://www.youtube.com/watch?v=KplFFvj3XRk

NOTE: Regarding taints. GPU nodes will get them on GKE, and pods requesting them will get tolerations, without any additional setup.

1. GKE Kubernetes cluster on a GPU enabled zone

Google has various zones (datacenters), some does not have GPUs. First you must have a GKE cluster coupled with a zone that has GPU access. To find out what zones has GPUs and what kind of GPUs it has, see this page. In overall performance and cost, K80 < P100 < V100. Note that there is also TPUs and that their availability is also zone dependant. This documentation will not address utilizing TPUs though.

Note that GKE Kubernetes clusters comes with a pre-installed with some parts needed for GPUs to be utilized:

A daemonset in your Kubernetes cluster called nvidia-gpu-device-plugin. I don't know fully what this does yet.
A custom resource controller plugin, which is enabled by default, that will handle extra resource requests such as nvidia.com/gpu: 1 properly.

2. JupyterHub installation

This documentation assumes you have deployed a JupyterHub already by following the https://z2jh.jupyter.org guide on your Kubernetes cluster.

3. Docker image for the JupyterHub users

I built an image for a basic Hello World with GPU enabled Tensorflow. If you are fine to utilize this, you don't need to do anything further. My image is available as consideratio/singleuser-gpu:v0.3.0.

About the Dockerfile

I build on top of a jupyter/docker-stacks image to allow JupyterHub to integrate well with. I also pinned cudatoolkit=9.0, it is a dependency of tensorflow-gpu but would install with a even newer version that is unsupported by the GPUs I'm aiming to use, namely Tesla K80 or Tesla P100. To learn more about these compatibility issues see: https://docs.anaconda.com/anaconda/user-guide/tasks/gpu-packages/

Dockerfile reference

NOTE: To make this run without a GPU available, you must still install an nvidia driver. This can be done using apt-get install nvidia-384, if you do, this must not conflict with the nvidia-driver-installer daemonset later that still needs to run sadly afaik. This is a rabbithole and hard to maintain I think.

# For the latest tag, see: https://hub.docker.com/r/jupyter/datascience-notebook/tags/
FROM jupyter/datascience-notebook:f2889d7ae7d6

# GPU powered ML
# ----------------------------------------
RUN conda install -c conda-forge --yes --quiet \
    tensorflow-gpu \
    cudatoolkit=9.0 && \
    conda clean -tipsy && \
    fix-permissions $CONDA_DIR && \
    fix-permissions /home/$NB_USER

# Allow drivers installed by the nvidia-driver-installer to be located
ENV LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/nvidia/lib64
# Also, utilities like `nvidia-smi` are installed here
ENV PATH=${PATH}:/usr/local/nvidia/bin

# To build and push a Dockerfile (in current working directory) to dockerhub under your username
DOCKER_USERNAME=my-docker-username
TAG=v0.3.0
docker build --tag ${DOCKER_USERNAME}/singleuser-gpu:${TAG} . && docker push ${DOCKER_USERNAME}/singleuser-gpu:${TAG}

3B. Create an image using repo2docker (WIP)

https://github.com/jupyterhub/team-compass/issues/96#issuecomment-447033166

4. Create a GPU node pool

Create a new node pool for your Kubernetes cluster. I choose a n1-highmem-2 node with a Tesla K80 GPU. These instructions are written and tested for K80 and P100.

Note that there is an issue of using autoscaling from 0 nodes, and that it is a slow process to scale up a GPU node as it needs to start, install drivers, and download the image file - each step takes quite a while. I'm expecting 5-10 minutes of startup for this. I recommend you start out with using a single fixed node while setting this up initially.

For details on how to setup a node pool with attached GPUs on the nodes, see: https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#create

5. Daemonset: nvidia-driver-installer

You need to make sure the GPU nodes gets appropriate drivers installed. This is what the nvidia-driver-installed daemonset will do for you! It will install drivers and utilities in /usr/local/nvidia, which is required for the conda package tensorflow-gpu for example to function properly.

NOTE: Tensorflow have a pinned dependency on cudatoolkit, and a given cudatoolkit requires a minimum NVIDIA driver version. tensorflow=1.11 and tensorflow=1.12 requires cudatoolkit=9.0 and tensorflow=1.13 will require cudatoolkit=10.0 for example, cudatoolkit=9.0 requires a NVIDIA driver of at least version 384.81 and cudatoolkit=10.0 requires a NVIDIA driver of at least version 410.48.

# To install the daemonset:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/daemonset.yaml

# Verify that it is installed, you should find something with:
kubectl get -n kube-system ds/nvidia-driver-installer

# To verify the daemonset pods are successfully installing drivers on the node(s)
# 1. See that there are one pod per GPU node running or attempting to run
kubectl get pods -n kube-system ds/nvidia-driver-installer

# 2. Print the relevant logs of the nvidia-driver-installer pods
kubectl logs -n kube-system ds/nvidia-driver-installer -c nvidia-driver-installer

# 3. Verify it ends with something like this:

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib'
         and X module path '/usr/lib/xorg/modules'; these paths were not
         queryable from the system.  If X fails to find the NVIDIA X driver
         module, please install the `pkg-config` utility and the X.Org
         SDK/development package for your distribution and reinstall the
         driver.

/
[INFO    2018-11-12 08:05:35 UTC] Updated cached version as:
CACHE_BUILD_ID=10895.52.0
CACHE_NVIDIA_DRIVER_VERSION=396.26
[INFO    2018-11-12 08:05:35 UTC] Verifying Nvidia installation
Mon Nov 12 08:05:37 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   56C    P0    86W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[INFO    2018-11-12 08:05:38 UTC] Finished installing the drivers.
[INFO    2018-11-12 08:05:38 UTC] Updating host's ld cache

Set a driver version for the nvidia-driver-installer daemonset to install

The default driver as of writing for the daemonset above, is 396.26. I struggled with installing that without this daemonset, so I ended up using 384.145 instead.

Option 1: Use a one liner

kubectl patch daemonset -n kube-system nvidia-driver-installer --patch '{"spec":{"template":{"spec":{"initContainers":[{"name":"nvidia-driver-installer","env":[{"name":"NVIDIA_DRIVER_VERSION","value":"384.145"}]}]}}}}'

Option 2: manually edit the daemonset manifest...

kubectl edit daemonset -n kube-system nvidia-driver-installer

# ... and then add the following entries to the init containers `env` (`spec.template.spec.init_containers[0].env`)
# - name: NVIDIA_DRIVER_VERSION
#   value: "384.145"

Reference: https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu

6. Configure some spawn options

Perhaps the user does not always need a GPU, so it is good to allow the user to choose instead. This can be done with the following configuration.

WARNING: Most recent z2jh Helm chart utilizes Kubespawner version 0.10.1 that has deprecated image_spec in favor of 'image', and this is updated to match this. Tweak this configuration to use image_spec if you want to deploy with an older version of the z2jh chart or Kubespawner.

singleuser:
  profileList:
    - display_name: "Default: Shared, 8 CPU cores"
      description: "By selecting this choice, you will be assigned a environment that will run on a shared machine with CPU only."
      default: True
    - display_name: "Dedicated, 2 CPU cores & 13GB RAM, 1 NVIDIA Tesla K80 GPU"
      description: "By selecting this choice, you will be assigned a environment that will run on a dedicated machine with a single GPU, just for you."
      kubespawner_override:
        image: consideratio/singleuser-gpu:v0.3.0
        extra_resource_limits:
          nvidia.com/gpu: "1"

Result

Note that this displays a screenshot of the configuration I've utilized, which differs slightly from the example configuration and setup documented in this post.

7. Verify GPU functionality

After you have got a Jupyter GPU pod launched and running, you could verify your GPU works as intended by...

Open a terminal and run:

# Verify that the following command ...
nvidia-smi

# has an output like below:
Mon Nov 12 10:38:07 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P0    72W / 149W |  10877MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Could not find nvidia-smi?

Perhaps the PATH variable was not configured properly? Try inspecting it and looking inside the folder where the nvidia-smi binary was supposed to be installed.

Perhaps the nvidia-driver-installer failed to install the driver?

Clone this repo

git clone [email protected]:aymericdamien/TensorFlow-Examples.git

Open a demonstration notebook, for example TensorFlow-Examples/notebooks/convolutional_network.ipynb, and run all cells.

Previous issues

Autoscaling - no longer an issue?

UPDATE: I'm not sure why this happened, but it doesn't happen any more for me.

I've had massive trouble autoscaling. I managed to autoscale from 1 to 2 nodes, but it took 37 minutes... Autoscale down worked as it should, with 10 minutes of a unused GPU node for the be scaled down.

To handle the long scale up time, you can configure a long timeout for kubespawner's spawning procedure like this:

singleuser:
  startTimeout: 3600

Latest update (2018-11-15)

I got autoscaling to work, but it is slow still, it takes about 9 minutes plus the time for your image to be pulled to the new node. Some lessons learned:

The cluster autoscaler runs simulations using a hardcoded copy of kube-scheduler default configuration logic, so utilizing a custom kube-scheduler configuration with different predicates could cause issues. See https://github.com/kubernetes/autoscaler/issues/1406 for more info.
I stopped using a dynamically applied label as a label selector (cloud.google.com/gke-accelerator=nvidia-tesla-k80). I don't remember if this worked at all with the cluster autoscaler, and that it worked to scale from both 0->1 node and from 1->2 nodes. If you want to select a specific GPU from multiple node pools, I'd recommend adding your own pre-defined labels like gpu: k80 and using them to nodeSelector select on.
I started using the default-scheduler instead of the jupyterhub-user-scheduler as I figure it would be safer to not risk there was a difference in what predicates they used even though they may have the exact same predicates configured. NOTE: a predicate is a function that takes information about a node in this case, and returns true or false if the node is a candidate to be scheduled on.

To debug the autoscaler:

Inspect the spawning pods events using kubectl describe pod -n jhub jupyter-erik-2esundell
Inspect the cluster autoscalers status configmap by running:

kubectl get cm -n kube-system cluster-autoscaler-status -o yaml

Look for the node pool in the output, mine was named user-k80

      Name:        https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west1-b/instanceGroups/gke-floss-user-k80-dd296e90-grp

Inspect the status of your node-pool regarding cloudProviderTarget, registered and ready.

      Name:        https://content.googleapis.com/compute/v1/projects/ds-platform/zones/europe-west1-b/instanceGroups/gke-floss-user-k80-dd296e90-grp
      Health:      Healthy (ready=2 unready=0 notStarted=0 longNotStarted=0 registered=2 longUnregistered=0 cloudProviderTarget=2 (minSize=1, maxSize=2))

You want all to become ready.
You can also inspect the node events with kubectl describe node the-name-of-the-node:

Events:
  Type    Reason                   Age   From                                              Message
  ----    ------                   ----  ----                                              -------
  Normal  Starting                 20m   kubelet, gke-floss-user-k80-dd296e90-99fw         Starting kubelet.
  Normal  NodeHasSufficientDisk    20m   kubelet, gke-floss-user-k80-dd296e90-99fw         Node gke-floss-user-k80-dd296e90-99fw status is now: NodeHasSufficientDisk
  Normal  NodeHasSufficientMemory  20m   kubelet, gke-floss-user-k80-dd296e90-99fw         Node gke-floss-user-k80-dd296e90-99fw status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    20m   kubelet, gke-floss-user-k80-dd296e90-99fw         Node gke-floss-user-k80-dd296e90-99fw status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     20m   kubelet, gke-floss-user-k80-dd296e90-99fw         Node gke-floss-user-k80-dd296e90-99fw status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  20m   kubelet, gke-floss-user-k80-dd296e90-99fw         Updated Node Allocatable limit across pods
  Normal  NodeReady                19m   kubelet, gke-floss-user-k80-dd296e90-99fw         Node gke-floss-user-k80-dd296e90-99fw status is now: NodeReady
  Normal  Starting                 19m   kube-proxy, gke-floss-user-k80-dd296e90-99fw      Starting kube-proxy.
  Normal  UnregisterNetDevice      14m   kernel-monitor, gke-floss-user-k80-dd296e90-99fw  Node condition FrequentUnregisterNetDevice is now: False, reason: UnregisterNetDevice

Potentially related:

I'm using Kubernetes 1.11.2-gke.9, but my GPU nodes apparently have 1.11.2-gke.15. Autoscaling from 0 nodes: https://github.com/kubernetes/autoscaler/issues/903

User placeholders for GPU nodes

Currently the user placeholders can only go to one kind of node pool, and it would make sense to allow the admin to configure how many placeholders for a normal pool and how many for a GPU pool. They are needed for autoscaling ahead of arriving users to not force them to wait for a new node, and this could be extra relevant for GPU nodes as they may need to be created on the fly every time for an arriving real user without the user placeholders.

We could perhaps instantiate multiple placeholder deployment/statefulsets based on a template and some extra specifications.

Pre pulling images specifically for GPU nodes

Currently we can only specify one kind of image puller, pulling all kinds of images to a single type of node. It is pointless to pull and especially to wait for image pulling of unneeded images, so it would be nice to optimize this somehow.

This is tracked in #992 (thanks @jzf2101!)

The future - Shared GPUs

Users cannot share GPUs like they can share CPU, this is an issue. But in the future, perhaps? From what I've heard this is something that is progressing right now.

Oct 25 '18 15:10 consideRatio

Awesome! 🍰

Oct 25 '18 15:10 minrk

@consideRatio Hi, do you think swapping in pytorch with tensorflow in the dockerfile will work? (changing conda channel and pytorch)

Oct 26 '18 20:10 koustuvsinha

@koustuvsinha yepp, installing both would also work i think.

Oct 26 '18 20:10 consideRatio

Cool. It sure will be fun to try to use GPUs on Azure AKS. Will report after having a chance to work on it.

Oct 26 '18 21:10 ablekh

The post is now updated, I think it is easier to read and has a more logical order to the steps taken. It also has some extra verification steps, but still not enough verification steps I think.

Nov 12 '18 10:11 consideRatio

This is related to https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/992 correct?

Nov 13 '18 21:11 jzf2101

@jzf2101 This is #994 :D You meant #1021? Yeah those that gets their own GPU etc could certainly be the kind of users that would appreciate being abple to do sudo apt-get install ... I'm glad you raised that issue, it is very relevant to me to get more knowledgeable about as well.

Nov 14 '18 10:11 consideRatio

Correction- I meant #992

Nov 14 '18 13:11 jzf2101

@jzf2101 ah! yepp thanks for connecting this

Nov 14 '18 14:11 consideRatio

Made an update to the text, I added information about autoscaling the GPU nodes. Something resolved itself, I'm not sure what, now it "only" takes 9 minutes + image pulling to get a GPU node ready.

Nov 15 '18 15:11 consideRatio

Which version of Ubuntu is in the Docker Images? I can't find it in the notes.

Nov 15 '18 22:11 jzf2101

@jzf2101 the image I provide in this post is built from jupyter/datascience-notebook (1), built in top of scipy-notebook (2), on top of minimal-notebook (3), on top of base-notebook (4), on top of ubuntu 18.04 aka bionic.

jupyter/datascience-notebook (https://github.com/jupyter/docker-stacks/blob/master/datascience-notebook/Dockerfile)
jupyter/scipy-notebook (https://github.com/jupyter/docker-stacks/blob/master/scipy-notebook/Dockerfile)
jupyter/minimal-notebook (https://github.com/jupyter/docker-stacks/blob/master/minimal-notebook/Dockerfile)
jupyter/base-notebook (https://github.com/jupyter/docker-stacks/blob/master/base-notebook/Dockerfile)

Nov 15 '18 23:11 consideRatio

@consideRatio Thank you for putting this together! I am currently stuck at Step #5. I get an error when I try to run kubectl logs

error: cannot get the logs from *extensions.DaemonSet

kubectl get -n kube-system ds/nvidia-driver-installer gets me this:

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE nvidia-driver-installer 1 1 1 1 1 45m

Suggestions?

Dec 07 '18 07:12 amanda-tan

@amanda-tan hmm clueless, but you could do a more explicit command:

kubectl logs -n kube-system nvidia-driver-installer-alskdjf

Where you would enter your actual pod name

Dec 07 '18 08:12 consideRatio

also the container name as the ds uses initconatiners:

kubectl logs -n kube-system nvidia-driver-installer-alskdjf -c nvidia-driver-installer

Best,

clkao

On Fri, 7 Dec 2018 at 16:40, Erik Sundell [email protected] wrote:

@amanda-tan https://github.com/amanda-tan hmm clueless, but you could do a more explicit command:

kubectl logs -n kube-system nvidia-driver-installer-alskdjf

Where you would enter your actual pod name

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/jupyterhub/zero-to-jupyterhub-k8s/issues/994#issuecomment-445161778, or mute the thread https://github.com/notifications/unsubscribe-auth/AAEQaD3vxirY2utl8UFaNnN4Lj6GESJwks5u2ilsgaJpZM4X6ezZ .

Dec 07 '18 08:12 clkao

@clkao Yes! That worked thank you!

Also, just wanted to add that I got this to work -- the profileListConfig did not work for me ; I probably made an error somewhere but just whittling it down to:

extraConfig: |- c.KubeSpawner.extra_resource_limits = {"nvidia.com/gpu": "1"}

worked like a charm. Thank you so much.

Dec 08 '18 01:12 amanda-tan

Has anyone tried provisioning pre-emptible GPU instances with this setup? I am having a hard time getting the beyond one instance of pre-emptible GPU.
Also, I am trying to use this setup for classroom and it seems extremely cost-ineffective; are there suggestions on how to lower the overall costs?

ETA: I guess there is also a Pre-emptible GPU quota which must be increased! That solved #1.

Jan 24 '19 21:01 amanda-tan

@amanda-tan yepp this will cost a lot. I don't know how to reduce the cost much, but the experience for the users can be improved greatly with user-placeholders as found in the z2jh 0.8-dev releases available already. Users would not have to wait for the scale up in best case with these. See the "optimizations" section of z2jh.jupyter.org for more info about such autoscaling optimizations. Requires k8s 1.11+ and Helm 2.11+.

Having multiple GPUs per node is also a reasonable idea, then the users could share some CPU even though they dont share the GPUs.

Jan 24 '19 22:01 consideRatio

I ran a short course using Jupyterhub and Kubernetes with pre-emptible GPUs and scaled up to about 50 users. I ran the nodes for 8 hours with a total cost of about $75 on Google Cloud. Using 10 CPU/8 GPU clusters worked well for me so that each user had 1 CPU and 1 GPU available. You do need an extra 2 CPUs per node to manage the sub-cluster, otherwise you will have 1 GPU sitting idle per cluster. Use K80 GPUs to keep costs minimized and make sure you are running in a region and zone that has them available. Adding extra RAM to a node is really cheap, so don't be afraid to do that beyond the 6.5 GB per CPU standard for the highmem instances.

Make sure you have your quota increase requests in well before you need the nodes for the course because that was one of the more challenging parts for me to get through. You will need the GPUs (all regions) and regional GPU quotas increased. There are also separate quotas for preemptible GPUs versus regular GPUs, so be aware of those. You may also run into issues with quotas on the number of CPUs and the number of IP addresses you can have, so check on all those.

Jan 24 '19 23:01 djgagne

I'am looking forward to GPU integration ! How can I apply this without gke? on my own kubernetes cluster ? How should I handle nvidia-driver-installer?

Mar 28 '19 01:03 FCtj

@FCtj I don't know, but you would need to redo some work GKE have done as well if youd want to do this. I would consider this a very advanced topic.

GKE had one daemonset registering GPY nodes etc to kubernetes before the nvidia-driver-installer came into play, i think it was called nvidia-device-plugin. Also note that this regards NViDIA graphics cards as well specifically.

Mar 28 '19 05:03 consideRatio

Thanks for this wonderful doc! This worked great for getting a GPU set up on our k8s cluster on google cloud. However, I have been having trouble getting it to work with tensorflow. I used your image with a Tesla P4 but got an issue:

In [2]: tf.Session()
2019-03-29 01:35:34.339396: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiledto use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-03-29 01:35:34.439991: E tensorflow/stream_executor/cuda/cuda_driver.cc:300] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2019-03-29 01:35:34.440053: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:163] retrieving CUDA diagnostic information for host: jupyter-...
2019-03-29 01:35:34.440063: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:170] hostname: jupyter-...
2019-03-29 01:35:34.440136: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:194] libcuda reported version is: 410.79.0
2019-03-29 01:35:34.440166: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:198] kernel reported version is: 410.79.0
2019-03-29 01:35:34.440174: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:305] kernel version seems to match DSO: 410.79.0

Any idea why this could happen?

Mar 29 '19 01:03 jrmlhermitte

@jrmlhermitte thanks for the feedback! Hmmm sadly that error didnt help me much, what happens if you run nvidia-smi, the terminal command line tool? It is supposed to describe some info about the GPU situation in general. I found that getting that functional was an essential first step before Tf.

Mar 29 '19 05:03 consideRatio

@jrmlhermitte What version of tensorflow-GPU did you install? The Tensorflow 1.13 binary is built with CUDA 10 while 1.12 and earlier is built with CUDA 9. Try pip install tensorflow-gpu==1.12 and see if that fixes the problem.

Mar 29 '19 10:03 djgagne

Thanks for the quick response @consideRatio . Here is my nvidia-smi output:

Sat Mar 30 03:02:44 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   35C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

pip install tensorflow-gpu==1.12 didn't work either.

Could it have something to do with the GPU? I may try to switch to a k80.

Mar 30 '19 03:03 jrmlhermitte

Allright, never mind, I got it to work!!!!!! I have to make sure to patch the daemonset before starting the node. That did the trick for me. (I also needed the patch you mentioned). I need to do some more reading myself about all this, but I'm also interested to hear more about your developments.

I would suggest mentioning this detail in the README. Oh, and here is nvidia-smi after a successful attempt:

jovyan@jupyter-jrmlhermitte:~$ nvidia-smi
Sat Mar 30 03:43:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.145                Driver Version: 384.145                   |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   55C    P0    24W /  75W |   7395MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Thanks for the feedback!

Mar 30 '19 03:03 jrmlhermitte

I'm a bit confused on the cudatoolkit==9.0.0 pinning. According to here: https://github.com/NVIDIA/nvidia-docker/wiki/CUDA 10.1 should be compatible with the Keplers.

(I am new to this so perhaps missing something)

Like @beniz (https://github.com/jolibrain/docker-stacks/tree/master/jupyter-dd-notebook-gpu) I am building of the Nvidia images ARG BASE_CONTAINER=nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 but i notice he pins to 9.0.0 as well in the Dockerfile as well.

Finally if one builds of the nvidia base image should one install tensorflow-gpu or just plain tensorflow (the recipe here suggests tensorflow-gpu, @beniz suggests tensorflow)....

May 30 '19 16:05 rahuldave

@rahuldave depending on what graphic card and drivers you have, you can use various versions of CUDA. This was done for NVIDIA K80's on Google Cloud a while back, the circumstances may have changed and not apply to you.

May 30 '19 16:05 consideRatio

I'm planning to run on GKE with K80's. Was just confused with https://github.com/NVIDIA/nvidia-docker/wiki/CUDA saying that the toolkit version 10.1 is compatible with Keplers.

I'll try the standard nvidia image (though i have this feeling that the conda tensorflow-gpu and pytorch packages may override my 10.1 install. We'll see.

EDIT: the conda build installs cudatoolkit 10.0 based on the conda dependencies. Wondering then if i needed to start the base-notebook from nvidia's 18.04 image (as @beniz does) rather than the jupyter one...

May 30 '19 17:05 rahuldave

@rahuldave hmmm this was the most troublesome part of the setup to get right to me. Perhaps K80 does not support the driver version that is required to use that version of CUDA then?

May 30 '19 18:05 consideRatio

zero-to-jupyterhub-k8s
zero-to-jupyterhub-k8s copied to clipboard

WIP: A deployment story - Using GPUs on GKE

GPU powered machine learning on GKE

Prerequisite knowledge

Kubernetes nodes, pods and daemonsets

Bonus knowledge:

1. GKE Kubernetes cluster on a GPU enabled zone

2. JupyterHub installation

3. Docker image for the JupyterHub users

About the Dockerfile

Dockerfile reference

3B. Create an image using repo2docker (WIP)

4. Create a GPU node pool

5. Daemonset: nvidia-driver-installer

Set a driver version for the nvidia-driver-installer daemonset to install

Reference: https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu

6. Configure some spawn options

Result

7. Verify GPU functionality

Previous issues

Autoscaling - no longer an issue?

Latest update (2018-11-15)

To debug the autoscaler:

Potentially related:

User placeholders for GPU nodes

Pre pulling images specifically for GPU nodes

The future - Shared GPUs

zero-to-jupyterhub-k8s zero-to-jupyterhub-k8s copied to clipboard

WIP: A deployment story - Using GPUs on GKE

GPU powered machine learning on GKE

Prerequisite knowledge

Kubernetes nodes, pods and daemonsets

Bonus knowledge:

1. GKE Kubernetes cluster on a GPU enabled zone

2. JupyterHub installation

3. Docker image for the JupyterHub users

About the Dockerfile

Dockerfile reference

3B. Create an image using repo2docker (WIP)

4. Create a GPU node pool

5. Daemonset: nvidia-driver-installer

Set a driver version for the nvidia-driver-installer daemonset to install

Reference: https://github.com/GoogleCloudPlatform/container-engine-accelerators/tree/master/cmd/nvidia_gpu

6. Configure some spawn options

Result

7. Verify GPU functionality

Previous issues

Autoscaling - no longer an issue?

Latest update (2018-11-15)

To debug the autoscaler:

Potentially related:

User placeholders for GPU nodes

Pre pulling images specifically for GPU nodes

The future - Shared GPUs

zero-to-jupyterhub-k8s
zero-to-jupyterhub-k8s copied to clipboard