datahub
datahub copied to clipboard
calendar scaler does not work for data100 - also I suspect it caused a hub outage
Bug description
My limited understanding of this tool is that it polls a calendar every minute, checks to see if there are events scheduled, and if there are it will provision placeholder pods which request a large number of resources in order to get the autoscaler to scale up more nodes. Unfortunately for data100, we increased the size of the nodes which breaks this mechanism.
Basically the placeholder pods request ~48GB of RAM which with our typical node configuration causes a new node to come up. However, the data100 nodes have ~200GB of ram which means these placeholder nodes can now have multiples of themselves placed on a single node. As a result it may...or may not cause additional nodes to come up.
Additionally, when conducting a test for scaling over a 15m period, something bad happened. Once the event was over the node-scaler crashed and all the hubs went down. I suspect the node-scaler was not happy about having multiple pods scheduled on the same node. How this translates into an outage I'm not sure.
During this period, the hub pods terminated, and then got stuck on start up with the following log:
$ kubectl -n datahub-staging logs hub-75fb67c49d-nkgsw
Defaulted container "templates-sync" out of: templates-sync, hub, templates-clone (init)
Error from server (BadRequest): container "templates-sync" in pod "hub-75fb67c49d-nkgsw" is waiting to start: PodInitializing
After several minutes this issue resolved itself.
Environment & setup
data100 specifically, but all hubs were briefly impacted by an outage
How to reproduce
I suspect scheduling additional node scaling events for data100 will cause this, but only if bad luck results in multiple placeholders ending up on the same node. Could probably get it to happen by increasing the node size for data100 by 2-3 over the current.
I suspect what needs to happen with node scaler is it needs to be made smarter. It probably needs to check values for the node pool it receives events for to decide how large the ram requests for node placeholder pods need to be.
As for the outage, I'm not sure what caused that.
Looking at this now, it looks like two of the three core nodes went away at the same time and were replaced by new nodes, causing this outage:
k get pod -A | rg ' hub-'
W1110 16:51:27.449959 87156 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
a11y-prod hub-6fb85c85f-v5cg8 1/1 Running 0 2d1h
a11y-staging hub-77ddfbbd64-bdnrb 1/1 Running 0 2d1h
astro-prod hub-69495c6d4-5n4h2 1/1 Running 0 3h35m
astro-staging hub-856f68457-nqh8j 1/1 Running 0 3h35m
biology-prod hub-6586657f6b-wpvj5 1/1 Running 0 3h35m
biology-staging hub-77b4d4446c-vgqn8 1/1 Running 0 3h35m
cee-prod hub-6845fc9cc7-lwf4t 1/1 Running 0 3h35m
cee-staging hub-5f8f9ddd48-94crx 1/1 Running 0 3h35m
data100-prod hub-5d6575f7df-z9rvf 1/1 Running 0 2d23h
data100-staging hub-76c4545874-zsvnm 1/1 Running 0 3h35m
data101-prod hub-7dc97c765d-wqsz4 1/1 Running 0 3h35m
data101-staging hub-549b6f6767-7wp4r 1/1 Running 0 3h35m
data102-prod hub-84b95ffb94-zf744 1/1 Running 0 3h35m
data102-staging hub-6cdcdb6cdd-glxhh 1/1 Running 0 3h35m
data8-prod hub-6959c77766-52g24 1/1 Running 0 3h35m
data8-staging hub-65fc55fb5-xghzl 1/1 Running 0 3h35m
datahub-prod hub-66c49b496-22njc 2/2 Running 0 2d1h
datahub-staging hub-75fb67c49d-nkgsw 2/2 Running 0 3h35m
dlab-prod hub-55c44b9d74-nm6ls 1/1 Running 0 3h35m
dlab-staging hub-5d8c9646bb-xghfg 1/1 Running 0 3h35m
eecs-prod hub-7b56548979-rdqxc 1/1 Running 0 3h35m
eecs-staging hub-6bf7b87c87-nbxvv 1/1 Running 0 3h35m
highschool-prod hub-6586fb77cd-txhgv 1/1 Running 0 3h35m
highschool-staging hub-ddb644f87-mk6nm 1/1 Running 0 3h35m
ischool-prod hub-579d949669-97h7f 1/1 Running 0 3h35m
ischool-staging hub-5b7ddb4677-hf9fw 1/1 Running 0 3h35m
julia-prod hub-54d5c54b4-mj774 1/1 Running 0 3h35m
julia-staging hub-59844f76d6-db69z 1/1 Running 0 3h35m
prob140-prod hub-76658fbb6c-vvxw8 1/1 Running 0 3h35m
prob140-staging hub-58bc5fcbb7-mlwxb 1/1 Running 0 3h35m
publichealth-prod hub-6bd4fd6fd9-dqs6w 1/1 Running 0 3h35m
publichealth-staging hub-66bf76768d-kshx4 1/1 Running 0 3h35m
r-prod hub-754c4787d5-tssvw 1/1 Running 0 3h35m
r-staging hub-864468b54c-z6bnf 1/1 Running 0 3h35m
shiny-prod hub-7cf969d6c5-mlw2z 1/1 Running 0 3h35m
shiny-staging hub-55b67c66f-tt559 1/1 Running 0 3h35m
stat159-prod hub-67ffd8f74b-4rqwn 1/1 Running 0 3h35m
stat159-staging hub-7fd674b586-4hxgf 1/1 Running 0 3h35m
stat20-prod hub-86b8d7cc98-d27sh 1/1 Running 0 3h35m
stat20-staging hub-68b5656755-rjhjq 1/1 Running 0 3h35m
workshop-prod hub-74bb87cbcf-gtq92 1/1 Running 0 3h35m
workshop-staging hub-549b5bf99-6hqqj 1/1 Running 0 3h35m
https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_cluster%22%0Aresource.labels.project_id%3D%22ucb-datahub-2018%22%0Aresource.labels.location%3D%22us-central1%22%0Aresource.labels.cluster_name%3D%22fall-2019%22%0AlogName%3D%22projects%2Fucb-datahub-2018%2Flogs%2Fcontainer.googleapis.com%252Fcluster-autoscaler-visibility%22%20severity%3E%3DDEFAULT;timeRange=2022-11-10T21:14:51.526Z%2F2022-11-11T00:51:38.473Z;cursorTimestamp=2022-11-10T21:35:31.951865557Z?project=ucb-datahub-2018 is the autoscaler logs and might have useful info.
the node-autoscaler doesn't actually touch the core pool itself, so possibly not directly related? I'm not 100% sure.