mybinder.org-deploy Re-evaluate machine types

We have updated our resource requests/limits in #636. Our current limits are:

memory:
- request: 1G
- limit: 2G
cpu:
- request: 0.1
- limit: 1

With this update, the event that triggered our latest autoscale event was "insufficient pod" because kubernetes imposes a 110 pod limit on nodes. Since our current n1-highmem-32 nodes have 32 CPUs and 208 GB of RAM, we will never trigger autoscale by cpu (~320 pods) or memory (~200 pods), always the absolute pod limit of 110. The upside is that even if all pods are using the limit of 2G, we won't run out of memory on the machine.

We could switch to the n1-standard-32 nodes, which have 120 GB, which would hit the machine limit around the memory request, but that would mean it's possible to oversubscribe by ~2x if a significant fraction of pods request a lot of memory, but the reason we've dropped the request is because this doesn't generally happen (kubectl top pod now shows 1/202 pods over 1G and 4/202 over 300M). We could also switch to n1-highmem-16 nodes, but then it's 16 CPUs for ~100 user pods, which is a lot if a significant fraction are really using it.

Price choices:

n1-highmem-32 $1.8944 (current)
n1-standard-32 $1.5200
n1-highmem-16 $0.9472

Dropping to highmem-16 has bigger savings, since it cuts our per-node price in half, and our memory requests come close to coinciding with the pods-per-node limit of 110. We would likely spend more time with 3-4 nodes than we do now, where we are spending most of the time on the 2/3 node borderline

Jun 11 '18 12:06 minrk

Additional issue with n1-highmem-16 is that we could easily heavily oversubscribe CPUs if ever a bunch of folks showed up and actually burned CPU, since 100 procs pegged at 1 CPU running on 16 cores would slow things down a bunch. Of course, this is also something that doesn't happen in typical load (currently only 3/231 pods using more than 0.1 CPU)

Jun 11 '18 13:06 minrk

Nice work finding the 110 pod limit.

I am feeling adventurous so why don't we try out n1-highmem-16 and see what happens? Most pods also don't seem to be using all that much CPU. We can switch to n1-standard-32 without too much notice if we see things going wrong/being super slow.

Jun 11 '18 13:06 betatim

Sounds good. Let's add that to the docket for tomorrow's scheduled deploy. We learned from the kubernetes cluster version upgrade recently that we can change machine types with close to zero downtime by allocating a new node pool and cordoning the old ones.

Jun 11 '18 13:06 minrk

So if I'm reading this correctly:

We may be using fancier computers on GCP than we really need, because we'll hit the pod limit on a node way before we'll hit the memory/cpu limit
One option is to cut our CPUs in half, which increases the chance that we're oversubscribed on CPUs but it's still quite low
In this case, we'd have 3-4 machines more often than the current 2-3 we currently have.
Since 16cpu is 50% of 32cpu machines, we'd get a decent cost savings.
The plan is to make this switch Tuesday AM europe time

yeah?

Jun 11 '18 15:06 choldgraf

With this update, the event that triggered our latest autoscale event was "insufficient pod" because kubernetes imposes a 110 pod limit on nodes.

Great find. Based on this additional info and further research by all of you, I'm totally cool with the switch. Let's make sure we document the fun facts (i.e. 110 pods).

Jun 11 '18 15:06 willingc

Note: I won't be available for deployment Tuesday tomorrow.

"A bunch of users who actually use CPU at the same time" sounds like that would happen to us in a workshop/classroom setting. The typical size there seems to be around 20-30ish. I think it is something we should keep an eye on, but with 3-4 machines running they'd hopefully distribute themselves across the cluster.

One thing I wanted to ask: the CPU promise we make to pods we need to run the service and monitor it are high enough or should we increase them a bit? The reason I'm asking is to make sure those pods aren't getting slowed down too much by a bunch of users who use CPUs. I don't know the answer, I can try and check tomorrow morning.

Jun 11 '18 18:06 betatim

"A bunch of users who actually use CPU at the same time" sounds like that would happen to us in a workshop/classroom setting.

Potentially, though most workshops spend the bulk of their time reading/typing/learning, not running cpu-burning code all at the same time. A scikit-learn model-training workshop is the sort of thing that could be an exception here.

The nice thing about over-subscribing CPUs vs memory is that the performance cost of is much lower. Running 200 max-burn threads on 16 CPUs makes everything slow (could cause an increase in spawn timeouts), but requesting 250 GB of RAM when only 200 are available causes big problems.

the CPU promise we make to pods we need to run the service and monitor it are high enough or should we increase them a bit?

I believe so. All of our services have fixed (request == limit) resources, and they tend to fit well within their requirements:

hub: 2 cpu
chp: 0.5 cpu
nginx: 1 cpu
grafana: 0.2 cpu
prometheus: 4 cpu

According to our Component Resource Metrics (yay for charts!) we are well within all of these limits.

However, the requests and limits don't put walls up that protect them from user pods. If we allow user pods to use more than all the CPUs, they will be able to take CPU time away from our services. So if user pods really start burning threads (e.g. mining cryptocurrency), we could see an impact, but the OS scheduler is pretty good at handling lots of threads.

I'm going to try creating the new thread pool this morning.

Jun 12 '18 07:06 minrk

@choldgraf great summary.

@willingc good idea, I'll look for a place in the docs to put something like "picking a node type"

Jun 12 '18 07:06 minrk

thanks for making the switch this AM - I'll let folks know if it causes a change in our billing rates!

Jun 12 '18 17:06 choldgraf

In one 24 hour cycle with a peak of 263 pods, we stayed on 3 nodes (just shy of requesting a new node). Since we were usually running on 3 nodes before, this should cut our CPU bill almost in half.

Jun 14 '18 08:06 minrk

here's the billing for the last 10 days - you can see the dip at the end

Jun 14 '18 18:06 choldgraf

Nice!!!!!

Jun 14 '18 20:06 willingc

screen shot 2018-07-20 at 22 28 33

Should we experiment with further lowering the guaranteed amount of memory per pod to drive up that memory used chart? I believe this chart shows how much of the memory is actually used. Not how much we allocated via the scheduler. Especially now that our core services are all in their cushy gated community.

We could also experiment with custom node types to get a more favourable ratio of CPU to RAM (if that saves us money)?

Another thing I'd be interested in hearing opinions on is how we could experiment with preemptible nodes. Does anyone have any experience with them? One thing that might work is to send all the instances that come from try.jupyter.org to those nodes. Would require checking that indeed those are the shortest lived binders we have.

Jul 20 '18 20:07 betatim

Is there any resource recommended to study regarding the "change of machine type" of an existing JupyterHub on K8s? I am now running a cluster with default-pool and user-pool. Both are n1-standard-2. I'd like to migrate both of them to n1-highmem-4 for a larger class. I've tried following https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool but did not succeed. I also tried the clusters resize command and got a quota limit response. Currently, I can only downgraded per-user RAM guarantee to 64M only in order to accommodate all students to log in during lecture.

Sep 19 '20 09:09 yaojenkuo

@yaojenkuo that's a good question, but it's unrelated to this issue. Would you mind asking your question on the community forum instead where more people hang out, and the rest of the community can benefit from the discussions?

Sep 19 '20 10:09 manics

mybinder.org-deploy mybinder.org-deploy copied to clipboard

Re-evaluate machine types

mybinder.org-deploy
mybinder.org-deploy copied to clipboard