gpt-2-cloud-run icon indicating copy to clipboard operation
gpt-2-cloud-run copied to clipboard

Spec for a GKE Kubernetes GPU Cluster w/ Cloud Run

Open minimaxir opened this issue 6 years ago • 6 comments

Create a k8s .yaml file spec which will create a cluster that can support GPT-2 APIs w/ GPU for faster serving

Goal

  • Each Node has as much GPU utilization as possible.
  • Able to scale down to zero (for real, GKE is picky about that)

Proposal

  • A single f1-micro Node so the GPU-pods can scale to 0 (a single f1-micro is free)
  • Other Node is a 16 vCPU /14GB RAM (n1-highcpu-16).
  • Each Pod uses 4 vCPU, 1 K80 GPU, and a has a Cloud Run concurrency of 4.

Therefore, a single Node can accommodate up to 4 different GPT-2 APIs or the same API scaled up, which is neat.

In testing a single K80 can generate about 20 texts at a time before going OOM, so setting a maximum of 16 should give enough of a buffer for storing the model. If not, using T4 GPUs should give a RAM boost.

minimaxir avatar May 18 '19 23:05 minimaxir

Cloud Run may not work well here because it does not allow you to configure number of vCPU per service.

It may be better to use raw Knative for it until Google adds that feature.

minimaxir avatar May 18 '19 23:05 minimaxir

Interesting issue with trying to put K80s on a n1-highcpu-16:

The number of GPU dies is linked to the number of CPU cores and memory selected for this instance. For the current configuration, you can select no fewer than 2 GPU dies of this type

So T4 it is.

minimaxir avatar May 19 '19 01:05 minimaxir

Better solution; actually leverage Python's async to minimize dedicated resources needed, so we can actually use K80s.

With gpt-2-simple, the generation is done completely in the GPU, so that might work. We might be able to get away with a 4 vCPU n1-standard-4 system (1 vCPU per Pod), and use a K80 (but still 4 concurrent users per Pod, 16 users per Node). The total cost is less than half of what was proposed.

And since it would be 1 vCPU used, we could set up Cloud Run with it, which might be easier than working with Knative.

minimaxir avatar May 19 '19 17:05 minimaxir

Unfortunately, this is not as easy expected since a tf.Session cannot be shared between threads and processes, therefore dramatically reducing the async possibilities.

For the initial release I might be OK without, especially if the GPU has high enough throughput.

minimaxir avatar May 23 '19 03:05 minimaxir

Update: you can share a tf.Session, but it's not easy and might not necessarily result in a performance gain. It however saves GPU vRAM, which is a necessary precondition. (estimate 2.5GB ceiling when generating 4 predictions at a time, so 4 containers will fit in a 12GB vRAM GPU).

Best architecture is still a 4vCPU + 1GPU w/ 4 containers, but it may be better to see if Cloud Run can assign each container 4vCPUs and then share threads (as Flask's native server is threaded by default and route accordingly). And then see if it causes any deadlocks.

minimaxir avatar May 24 '19 20:05 minimaxir

Hi Max! Thank you so much for creating gpt-2-cloud-run. It's been really useful and inspiring for my GPT-2 webapp. For this webapp I'm trying to deploy a finetuned 345M GPT-2 Model (~1.4 GB) through Cloud Run on GKE but I am unsure about the spec of the GKE Cluster as well as what concurrency should I set.

Can you please advice on the number of nodes, machine type and concurrency I should be using for maximum cost effectiveness? Currently, I have a concurrency of 1 along with just 1 node (n1-standard-2; 7.5GB; 2vCPU) and a K80 attached to that node but I'm not sure if this is the most cost-effective GKE spec.

I would really appreciate any insights on this! If it helps I intend to deploy only this model and don't plan on having any more GPT-2 webapps.

kshtzgupta1 avatar Sep 05 '19 05:09 kshtzgupta1