mybinder.org-deploy
mybinder.org-deploy copied to clipboard
Re-evaluate machine types
We have updated our resource requests/limits in #636. Our current limits are:
- memory:
- request: 1G
- limit: 2G
- cpu:
- request: 0.1
- limit: 1
With this update, the event that triggered our latest autoscale event was "insufficient pod" because kubernetes imposes a 110 pod limit on nodes. Since our current n1-highmem-32 nodes have 32 CPUs and 208 GB of RAM, we will never trigger autoscale by cpu (~320 pods) or memory (~200 pods), always the absolute pod limit of 110. The upside is that even if all pods are using the limit of 2G, we won't run out of memory on the machine.
We could switch to the n1-standard-32 nodes, which have 120 GB, which would hit the machine limit around the memory request, but that would mean it's possible to oversubscribe by ~2x if a significant fraction of pods request a lot of memory, but the reason we've dropped the request is because this doesn't generally happen (kubectl top pod
now shows 1/202 pods over 1G and 4/202 over 300M). We could also switch to n1-highmem-16 nodes, but then it's 16 CPUs for ~100 user pods, which is a lot if a significant fraction are really using it.
Price choices:
- n1-highmem-32 $1.8944 (current)
- n1-standard-32 $1.5200
- n1-highmem-16 $0.9472
Dropping to highmem-16 has bigger savings, since it cuts our per-node price in half, and our memory requests come close to coinciding with the pods-per-node limit of 110. We would likely spend more time with 3-4 nodes than we do now, where we are spending most of the time on the 2/3 node borderline
Additional issue with n1-highmem-16 is that we could easily heavily oversubscribe CPUs if ever a bunch of folks showed up and actually burned CPU, since 100 procs pegged at 1 CPU running on 16 cores would slow things down a bunch. Of course, this is also something that doesn't happen in typical load (currently only 3/231 pods using more than 0.1 CPU)
Nice work finding the 110 pod limit.
I am feeling adventurous so why don't we try out n1-highmem-16
and see what happens? Most pods also don't seem to be using all that much CPU. We can switch to n1-standard-32
without too much notice if we see things going wrong/being super slow.
Sounds good. Let's add that to the docket for tomorrow's scheduled deploy. We learned from the kubernetes cluster version upgrade recently that we can change machine types with close to zero downtime by allocating a new node pool and cordoning the old ones.
So if I'm reading this correctly:
- We may be using fancier computers on GCP than we really need, because we'll hit the pod limit on a node way before we'll hit the memory/cpu limit
- One option is to cut our CPUs in half, which increases the chance that we're oversubscribed on CPUs but it's still quite low
- In this case, we'd have 3-4 machines more often than the current 2-3 we currently have.
- Since 16cpu is 50% of 32cpu machines, we'd get a decent cost savings.
- The plan is to make this switch Tuesday AM europe time
yeah?
With this update, the event that triggered our latest autoscale event was "insufficient pod" because kubernetes imposes a 110 pod limit on nodes.
Great find. Based on this additional info and further research by all of you, I'm totally cool with the switch. Let's make sure we document the fun facts (i.e. 110 pods).
Note: I won't be available for deployment Tuesday tomorrow.
"A bunch of users who actually use CPU at the same time" sounds like that would happen to us in a workshop/classroom setting. The typical size there seems to be around 20-30ish. I think it is something we should keep an eye on, but with 3-4 machines running they'd hopefully distribute themselves across the cluster.
One thing I wanted to ask: the CPU promise we make to pods we need to run the service and monitor it are high enough or should we increase them a bit? The reason I'm asking is to make sure those pods aren't getting slowed down too much by a bunch of users who use CPUs. I don't know the answer, I can try and check tomorrow morning.
"A bunch of users who actually use CPU at the same time" sounds like that would happen to us in a workshop/classroom setting.
Potentially, though most workshops spend the bulk of their time reading/typing/learning, not running cpu-burning code all at the same time. A scikit-learn model-training workshop is the sort of thing that could be an exception here.
The nice thing about over-subscribing CPUs vs memory is that the performance cost of is much lower. Running 200 max-burn threads on 16 CPUs makes everything slow (could cause an increase in spawn timeouts), but requesting 250 GB of RAM when only 200 are available causes big problems.
the CPU promise we make to pods we need to run the service and monitor it are high enough or should we increase them a bit?
I believe so. All of our services have fixed (request == limit) resources, and they tend to fit well within their requirements:
- hub: 2 cpu
- chp: 0.5 cpu
- nginx: 1 cpu
- grafana: 0.2 cpu
- prometheus: 4 cpu
According to our Component Resource Metrics (yay for charts!) we are well within all of these limits.
However, the requests and limits don't put walls up that protect them from user pods. If we allow user pods to use more than all the CPUs, they will be able to take CPU time away from our services. So if user pods really start burning threads (e.g. mining cryptocurrency), we could see an impact, but the OS scheduler is pretty good at handling lots of threads.
I'm going to try creating the new thread pool this morning.
@choldgraf great summary.
@willingc good idea, I'll look for a place in the docs to put something like "picking a node type"
thanks for making the switch this AM - I'll let folks know if it causes a change in our billing rates!
In one 24 hour cycle with a peak of 263 pods, we stayed on 3 nodes (just shy of requesting a new node). Since we were usually running on 3 nodes before, this should cut our CPU bill almost in half.
here's the billing for the last 10 days - you can see the dip at the end
Nice!!!!!

Should we experiment with further lowering the guaranteed amount of memory per pod to drive up that memory used chart? I believe this chart shows how much of the memory is actually used. Not how much we allocated via the scheduler. Especially now that our core services are all in their cushy gated community.
We could also experiment with custom node types to get a more favourable ratio of CPU to RAM (if that saves us money)?
Another thing I'd be interested in hearing opinions on is how we could experiment with preemptible nodes. Does anyone have any experience with them? One thing that might work is to send all the instances that come from try.jupyter.org to those nodes. Would require checking that indeed those are the shortest lived binders we have.
Is there any resource recommended to study regarding the "change of machine type" of an existing JupyterHub on K8s? I am now running a cluster with default-pool and user-pool. Both are n1-standard-2. I'd like to migrate both of them to n1-highmem-4 for a larger class. I've tried following https://cloud.google.com/kubernetes-engine/docs/tutorials/migrating-node-pool but did not succeed. I also tried the clusters resize command and got a quota limit response. Currently, I can only downgraded per-user RAM guarantee to 64M only in order to accommodate all students to log in during lecture.
@yaojenkuo that's a good question, but it's unrelated to this issue. Would you mind asking your question on the community forum instead where more people hang out, and the rest of the community can benefit from the discussions?