pangeo-cloud-federation
pangeo-cloud-federation copied to clipboard
hub resource requests not being over-written by config/common.yaml
We are squeezing all hub and core pods onto a single m5.large instance for the nasa deployment, but some pods aren't scheduling b/c we're requesting over 100% CPU. It seems like there might be a bug in overwriting the hub resource requests here:
https://github.com/pangeo-data/pangeo-cloud-federation/blob/074777d914baaf61669d935967cd6647625ae8bb/deployments/nasa/config/common.yaml#L88-L95
Because the pod has the following config:
Containers:
hub:
Container ID: docker://d166f73701a1a3e46860bf8f54d969a1a3ad6d3c6225177347a649e856596bb1
Image: jupyterhub/k8s-hub:0.9-445a953
Image ID: docker-pullable://jupyterhub/k8s-hub@sha256:0d6412029ad485fc704393d6bf7cea7c29ea5d08aa1a77de5ee65173afe5cd1a
Port: 8081/TCP
Host Port: 0/TCP
Command:
jupyterhub
--config
/srv/jupyterhub_config.py
--upgrade-db
State: Running
Started: Mon, 29 Jul 2019 17:47:43 -0700
Ready: True
Restart Count: 0
Limits:
cpu: 1250m
memory: 1Gi
Requests:
cpu: 500m
memory: 1Gi
Our node currently looks like this
Non-terminated Pods: (15 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
kube-system aws-node-qld9c 10m (0%) 0 (0%) 0 (0%) 0 (0%) 28h
kube-system cluster-autoscaler-848d9c584f-nf5q4 100m (5%) 100m (5%) 300Mi (3%) 300Mi (3%) 28h
kube-system coredns-67cdb69b9b-4tvfb 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 28h
kube-system coredns-67cdb69b9b-94wln 100m (5%) 0 (0%) 70Mi (0%) 170Mi (2%) 28h
kube-system kube-proxy-jsssn 100m (5%) 0 (0%) 0 (0%) 0 (0%) 28h
kube-system tiller-deploy-59d477f79-rjjbj 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h
nasa-prod autohttps-5c66cbb7b-p5q5c 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h
nasa-prod proxy-74869565f5-h7ght 200m (10%) 0 (0%) 512Mi (6%) 0 (0%) 28h
nasa-prod user-scheduler-6d66788464-86792 50m (2%) 0 (0%) 256Mi (3%) 0 (0%) 28h
nasa-prod user-scheduler-6d66788464-xbjwq 50m (2%) 0 (0%) 256Mi (3%) 0 (0%) 28h
nasa-staging autohttps-5c7b9c9b58-6d8r7 0 (0%) 0 (0%) 0 (0%) 0 (0%) 28h
nasa-staging hub-696fcc7c6c-m2mm5 500m (25%) 1250m (62%) 1Gi (13%) 1Gi (13%) 11m
nasa-staging proxy-56c6b94475-nccgr 200m (10%) 0 (0%) 512Mi (6%) 0 (0%) 12h
nasa-staging user-scheduler-7d46c957bc-cc9rt 50m (2%) 0 (0%) 256Mi (3%) 0 (0%) 12h
nasa-staging user-scheduler-7d46c957bc-d8gpk 50m (2%) 0 (0%) 256Mi (3%) 0 (0%) 12h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1510m (75%) 1350m (67%)
memory 3512Mi (46%) 1664Mi (21%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
And the error to schedule a second hub pod is Warning FailedScheduling 13s (x6 over 4m17s) default-scheduler 0/2 nodes are available: 1 Insufficient cpu, 1 node(s) didn't match node selector.
cc @jhamman
Looks to me like you are requesting 0.5 cpu and you have 0.49 left on this node, nominally. (I don't think it will give you exactly 100% possibly because of some overhead requirements.) Anyway, this is the behavior I would expect. Am I not understanding something?
Looks to me like you are requesting 0.5 cpu and you have 0.49 left on this node, nominally. (I don't think it will give you exactly 100% possibly because of some overhead requirements.) Anyway, this is the behavior I would expect. Am I not understanding something?
The main issue is that I requested 400 cpu in common.yaml
to fit it, but the pod still wants 500 cpu
So looking into this a little more, it seems related to the fact that helm upgrade
can have inconsistencies if the deployment fails at some point and then other upgrades are done on top.
In brief, I think what happens it helm only sees "config - 1" and has some trouble with deployments and replicasets that define the hub resources. So if the configuration with new resource limits fails for some reason, then subsequent helm upgrades
don't see any difference in the chart resource limits and don't modify the deployment. See:
https://github.com/helm/helm/issues/1873#issuecomment-429460214
Here is our helm history:
149 Sun Jul 28 15:24:22 2019 SUPERSEDED pangeo-deploy-0.1.0 Upgrade complete
150 Mon Jul 29 05:10:41 2019 FAILED pangeo-deploy-0.1.0 Upgrade "nasa-staging" failed: timed out waiting for the ...
151 Mon Jul 29 06:16:28 2019 FAILED pangeo-deploy-0.1.0 Upgrade "nasa-staging" failed: kind DaemonSet with the na...
152 Mon Jul 29 17:32:45 2019 FAILED pangeo-deploy-0.1.0 Upgrade "nasa-staging" failed: kind DaemonSet with the na...
153 Mon Jul 29 17:44:43 2019 SUPERSEDED pangeo-deploy-0.1.0 Upgrade complete
154 Mon Jul 29 21:54:58 2019 SUPERSEDED pangeo-deploy-0.1.0 Upgrade complete
155 Mon Jul 29 22:23:11 2019 DEPLOYED pangeo-deploy-0.1.0 Upgrade complete
So we could rollback to 149 and redeploy with a new commit to staging
, or probably just slightly tweaking the resource requests and redeploying should also work (but might not be the cleanest approach).
There are all sorts of helpful flags in helm upgrade
, such as --reset-values
, --recreate-pods
, and --force
to name a few that I use fairly often. I think --reset-values
might solve the problem you are having. More on the various flags here. But by using hubploy in CircleCI we don't really have an easy way to pass additional flags, without modifying the source here, which is kludgy really. I wonder if @yuvipanda has thought about allowing the use of some of these flags in the hubploy code.