pangeo-cloud-federation icon indicating copy to clipboard operation
pangeo-cloud-federation copied to clipboard

cluster autoscaler was broken on AWS

Open scottyhq opened this issue 3 years ago • 0 comments

tried to log into https://aws-uswest2.pangeo.io today but was not getting a server. turns out our core node (running on spot) was shut down and when the cluster-autoscaler pod was recreated it was crashing due to https://github.com/kubernetes/autoscaler/issues/3506.

just wanted to document here another case of things running on k8s will eventually break due to version incompatibilities or other things. The fix was pretty easy, and configuration files are here https://github.com/pangeo-data/pangeo-cloud-federation/pull/968.

Just had to give the autoscaler pod more memory, so we went from:

        - image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.16.5
           name: cluster-autoscaler
           resources:
             limits:
               cpu: 100m
               memory: 300Mi
             requests:
               cpu: 100m
               memory: 300Mi

to:

        - image: us.gcr.io/k8s-artifacts-prod/autoscaling/cluster-autoscaler:v1.16.7
           name: cluster-autoscaler
           resources:
             limits:
               cpu: 100m
               memory: 600Mi
             requests:
               cpu: 100m
               memory: 600Mi

scottyhq avatar Jul 22 '21 23:07 scottyhq