datahub calendar scaler does not work for data100 - also I suspect it caused a hub outage

Bug description

My limited understanding of this tool is that it polls a calendar every minute, checks to see if there are events scheduled, and if there are it will provision placeholder pods which request a large number of resources in order to get the autoscaler to scale up more nodes. Unfortunately for data100, we increased the size of the nodes which breaks this mechanism.

Basically the placeholder pods request ~48GB of RAM which with our typical node configuration causes a new node to come up. However, the data100 nodes have ~200GB of ram which means these placeholder nodes can now have multiples of themselves placed on a single node. As a result it may...or may not cause additional nodes to come up.

Additionally, when conducting a test for scaling over a 15m period, something bad happened. Once the event was over the node-scaler crashed and all the hubs went down. I suspect the node-scaler was not happy about having multiple pods scheduled on the same node. How this translates into an outage I'm not sure.

During this period, the hub pods terminated, and then got stuck on start up with the following log:

$ kubectl -n datahub-staging logs hub-75fb67c49d-nkgsw
Defaulted container "templates-sync" out of: templates-sync, hub, templates-clone (init)
Error from server (BadRequest): container "templates-sync" in pod "hub-75fb67c49d-nkgsw" is waiting to start: PodInitializing

After several minutes this issue resolved itself.

Environment & setup

data100 specifically, but all hubs were briefly impacted by an outage

How to reproduce

I suspect scheduling additional node scaling events for data100 will cause this, but only if bad luck results in multiple placeholders ending up on the same node. Could probably get it to happen by increasing the node size for data100 by 2-3 over the current.

Nov 10 '22 21:11 felder

I suspect what needs to happen with node scaler is it needs to be made smarter. It probably needs to check values for the node pool it receives events for to decide how large the ram requests for node placeholder pods need to be.

As for the outage, I'm not sure what caused that.

Nov 10 '22 21:11 felder

Looking at this now, it looks like two of the three core nodes went away at the same time and were replaced by new nodes, causing this outage:

 k get pod -A | rg ' hub-'
W1110 16:51:27.449959   87156 gcp.go:119] WARNING: the gcp auth plugin is deprecated in v1.22+, unavailable in v1.26+; use gcloud instead.
To learn more, consult https://cloud.google.com/blog/products/containers-kubernetes/kubectl-auth-changes-in-gke
a11y-prod              hub-6fb85c85f-v5cg8                                              1/1     Running           0               2d1h
a11y-staging           hub-77ddfbbd64-bdnrb                                             1/1     Running           0               2d1h
astro-prod             hub-69495c6d4-5n4h2                                              1/1     Running           0               3h35m
astro-staging          hub-856f68457-nqh8j                                              1/1     Running           0               3h35m
biology-prod           hub-6586657f6b-wpvj5                                             1/1     Running           0               3h35m
biology-staging        hub-77b4d4446c-vgqn8                                             1/1     Running           0               3h35m
cee-prod               hub-6845fc9cc7-lwf4t                                             1/1     Running           0               3h35m
cee-staging            hub-5f8f9ddd48-94crx                                             1/1     Running           0               3h35m
data100-prod           hub-5d6575f7df-z9rvf                                             1/1     Running           0               2d23h
data100-staging        hub-76c4545874-zsvnm                                             1/1     Running           0               3h35m
data101-prod           hub-7dc97c765d-wqsz4                                             1/1     Running           0               3h35m
data101-staging        hub-549b6f6767-7wp4r                                             1/1     Running           0               3h35m
data102-prod           hub-84b95ffb94-zf744                                             1/1     Running           0               3h35m
data102-staging        hub-6cdcdb6cdd-glxhh                                             1/1     Running           0               3h35m
data8-prod             hub-6959c77766-52g24                                             1/1     Running           0               3h35m
data8-staging          hub-65fc55fb5-xghzl                                              1/1     Running           0               3h35m
datahub-prod           hub-66c49b496-22njc                                              2/2     Running           0               2d1h
datahub-staging        hub-75fb67c49d-nkgsw                                             2/2     Running           0               3h35m
dlab-prod              hub-55c44b9d74-nm6ls                                             1/1     Running           0               3h35m
dlab-staging           hub-5d8c9646bb-xghfg                                             1/1     Running           0               3h35m
eecs-prod              hub-7b56548979-rdqxc                                             1/1     Running           0               3h35m
eecs-staging           hub-6bf7b87c87-nbxvv                                             1/1     Running           0               3h35m
highschool-prod        hub-6586fb77cd-txhgv                                             1/1     Running           0               3h35m
highschool-staging     hub-ddb644f87-mk6nm                                              1/1     Running           0               3h35m
ischool-prod           hub-579d949669-97h7f                                             1/1     Running           0               3h35m
ischool-staging        hub-5b7ddb4677-hf9fw                                             1/1     Running           0               3h35m
julia-prod             hub-54d5c54b4-mj774                                              1/1     Running           0               3h35m
julia-staging          hub-59844f76d6-db69z                                             1/1     Running           0               3h35m
prob140-prod           hub-76658fbb6c-vvxw8                                             1/1     Running           0               3h35m
prob140-staging        hub-58bc5fcbb7-mlwxb                                             1/1     Running           0               3h35m
publichealth-prod      hub-6bd4fd6fd9-dqs6w                                             1/1     Running           0               3h35m
publichealth-staging   hub-66bf76768d-kshx4                                             1/1     Running           0               3h35m
r-prod                 hub-754c4787d5-tssvw                                             1/1     Running           0               3h35m
r-staging              hub-864468b54c-z6bnf                                             1/1     Running           0               3h35m
shiny-prod             hub-7cf969d6c5-mlw2z                                             1/1     Running           0               3h35m
shiny-staging          hub-55b67c66f-tt559                                              1/1     Running           0               3h35m
stat159-prod           hub-67ffd8f74b-4rqwn                                             1/1     Running           0               3h35m
stat159-staging        hub-7fd674b586-4hxgf                                             1/1     Running           0               3h35m
stat20-prod            hub-86b8d7cc98-d27sh                                             1/1     Running           0               3h35m
stat20-staging         hub-68b5656755-rjhjq                                             1/1     Running           0               3h35m
workshop-prod          hub-74bb87cbcf-gtq92                                             1/1     Running           0               3h35m
workshop-staging       hub-549b5bf99-6hqqj                                              1/1     Running           0               3h35m

https://console.cloud.google.com/logs/query;query=resource.type%3D%22k8s_cluster%22%0Aresource.labels.project_id%3D%22ucb-datahub-2018%22%0Aresource.labels.location%3D%22us-central1%22%0Aresource.labels.cluster_name%3D%22fall-2019%22%0AlogName%3D%22projects%2Fucb-datahub-2018%2Flogs%2Fcontainer.googleapis.com%252Fcluster-autoscaler-visibility%22%20severity%3E%3DDEFAULT;timeRange=2022-11-10T21:14:51.526Z%2F2022-11-11T00:51:38.473Z;cursorTimestamp=2022-11-10T21:35:31.951865557Z?project=ucb-datahub-2018 is the autoscaler logs and might have useful info.

the node-autoscaler doesn't actually touch the core pool itself, so possibly not directly related? I'm not 100% sure.

Nov 11 '22 01:11 yuvipanda

datahub datahub copied to clipboard

calendar scaler does not work for data100 - also I suspect it caused a hub outage

Bug description

Environment & setup

How to reproduce

datahub
datahub copied to clipboard