autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Autoscaler helm chart fails with WebIdentityErr: failed to retrieve credentials even though AWS full config works with the same parameters

Open maegul opened this issue 2 years ago • 4 comments

Which component are you using?:

  • Cluster autoscaler, helm chart
  • helm chart (I suspect)

What version of the component are you using?:

  • helm chart: 9.16.2
  • cluster-autoscaler: v1.20.0 | v1.21.0 | v1.21.1

Component version: see above

What k8s version are you using (kubectl version)?: 1.20

kubectl version Output
$ kubectl version

What environment is this in?: AWS EKS

What did you expect to happen?: for the autoscaling pod to not crash (and then work 😄 )

What happened instead?:

  • Within ~1 minute the autoscaler pod crashes and then quickly enters into CrashLoopBackOff status
  • Inspecting the log of the autoscaler pod reveals the cause to (most likely) be that the serviceaccount role and policy has not been properly associated or used (see my final comments below):
E0406 02:32:59.407595       1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
  • This appears immediately after the pod has gotten/requested all the EC2 types

How to reproduce it (as minimally and precisely as possible):

  • create simple default cluster on AWS EKS with eksctl
  • Add an OIDC provider as instructed here on AWS Docs
  • create an IAM policy and role as instructed here on AWS Docs
  • install this helm chart
  • Create a values config file like below and deploy a release
autoDiscovery:
  clusterName: jhubproto2

cloudProvider: aws
awsRegion: REGION

image:
  tag: v1.21.1

# small bump in resources from default inline with AWS recommendation (more memory: 600Mi)
# seemed to be necessary, if default of 300m is used, seemed to crash
resources:
  limits:
    cpu: 100m
    memory: 600Mi
  requests:
    cpu: 100m
    memory: 600Mi


# give the autoscaler the required iam role+policy to scale instances
rbac:
  serviceAccount:
    annotations:
      eks.amazonaws.com/role-arn: ROLE-ARN-FROM-ABOVE-STEP

# to ensure that the autoscaler pod itself is not scaled away?
podAnnotations:
  # false needs to be string (don't know why)
  cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

# add arguments to the autoscaler container itself
extraArgs:
  logtostderr: true
  stderrthreshold: info
  v: 4
  # ... evenly distribute pods across nodes and AZs
  balance-similar-node-groups: true
  # ... allows for scaling to zero
  skip-nodes-with-system-pods: false

Anything else we need to know?:

  • Yes ... most importantly ...

  • The autoscaler works just fine, and the error in the logs reported above disappears when the complete yaml config file recommended by AWS here is used instead of this helm chart, where everything else is kept the same and the same parameters are used.

  • This complete config can be found here, in this repo under the autoscaler/cluster-autoscaler/cloudprovider/aws section.

  • Below, I've provided a manual diff I've done between the rendered values of the helm chart and this complete config (where I had to also reorder some of the sections to that they line up).

  • Without understanding all of it, some of the things that stand out to me as possible issues:

    • the AWS full config uses name: cluster-autoscaler while the chart uses an arbitrary many times instead. Given that the AWS role is conditional on a service account ID of ...cluster-autoscaler ... this might be a problem.
    • Similarly, the full config specifies the namespace more often than the chart.
    • the full config mounts some cert directories (eg /etc/ssl/certs/ca-certificates.crt) where the chart appears not to be doing the same thing.
  • Without wanting to complain, it does seem that there is some disconnect here between this chart and AWS that is rather confusing to the non-expert. It's a shame that AWS and this chart seem to be somewhat incompatible at the moment (going off of the other issues i've seen reported regarding AWS) given how much nicer it is to use helm.

  • At the moment though, unless I've done something silly here, I'd say it's best to just do what AWS recommend rather than try to use the autoscaler helm chart on AWS.


  • This diff was made from the AWS complete config file (cluster-autoscaler-autodiscover.yaml) to the rendered values of the helm chart (autoscaler_dry_run.yaml) ... ie, changes are made by the helm chart to the full AWS config file.
diff --git a/cluster-autoscaler-autodiscover.yaml b/autoscaler_dry_run.yaml
index 61027ce..717f535 100644
--- a/cluster-autoscaler-autodiscover.yaml
+++ b/autoscaler_dry_run.yaml
@@ -1,154 +1,282 @@
+NAME: test
+LAST DEPLOYED: Wed Apr  6 15:32:45 2022
+NAMESPACE: kube-system
+STATUS: pending-install
+REVISION: 1
+TEST SUITE: None
+HOOKS:
+MANIFEST:
 ---
+# Source: cluster-autoscaler/templates/serviceaccount.yaml
 apiVersion: v1
 kind: ServiceAccount
 metadata:
   labels:
-    k8s-addon: cluster-autoscaler.addons.k8s.io
-    k8s-app: cluster-autoscaler
-  name: cluster-autoscaler
-  namespace: kube-system
-  annotations:
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
+  annotations: 
     eks.amazonaws.com/role-arn: ROLE-ARN
+automountServiceAccountToken: true
 ---
+# Source: cluster-autoscaler/templates/clusterrole.yaml
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRole
 metadata:
-  name: cluster-autoscaler
   labels:
-    k8s-addon: cluster-autoscaler.addons.k8s.io
-    k8s-app: cluster-autoscaler
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
 rules:
-  - apiGroups: [""]
-    resources: ["events", "endpoints"]
-    verbs: ["create", "patch"]
-  - apiGroups: [""]
-    resources: ["pods/eviction"]
-    verbs: ["create"]
-  - apiGroups: [""]
-    resources: ["pods/status"]
-    verbs: ["update"]
-  - apiGroups: [""]
-    resources: ["endpoints"]
-    resourceNames: ["cluster-autoscaler"]
-    verbs: ["get", "update"]
-  - apiGroups: [""]
-    resources: ["nodes"]
-    verbs: ["watch", "list", "get", "update"]
-  - apiGroups: [""]
-    resources:
-      - "namespaces"
-      - "pods"
-      - "services"
-      - "replicationcontrollers"
-      - "persistentvolumeclaims"
-      - "persistentvolumes"
-    verbs: ["watch", "list", "get"]
-  - apiGroups: ["extensions"]
-    resources: ["replicasets", "daemonsets"]
-    verbs: ["watch", "list", "get"]
-  - apiGroups: ["policy"]
-    resources: ["poddisruptionbudgets"]
-    verbs: ["watch", "list"]
-  - apiGroups: ["apps"]
-    resources: ["statefulsets", "replicasets", "daemonsets"]
-    verbs: ["watch", "list", "get"]
-  - apiGroups: ["storage.k8s.io"]
-    resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
-    verbs: ["watch", "list", "get"]
-  - apiGroups: ["batch", "extensions"]
-    resources: ["jobs"]
-    verbs: ["get", "list", "watch", "patch"]
-  - apiGroups: ["coordination.k8s.io"]
-    resources: ["leases"]
-    verbs: ["create"]
-  - apiGroups: ["coordination.k8s.io"]
-    resourceNames: ["cluster-autoscaler"]
-    resources: ["leases"]
-    verbs: ["get", "update"]
+  - apiGroups:
+      - ""
+    resources:
+      - events
+      - endpoints
+    verbs:
+      - create
+      - patch
+  - apiGroups:
+    - ""
+    resources:
+    - pods/eviction
+    verbs:
+    - create
+  - apiGroups:
+      - ""
+    resources:
+      - pods/status
+    verbs:
+      - update
+  - apiGroups:
+      - ""
+    resources:
+      - endpoints
+    resourceNames:
+      - cluster-autoscaler
+    verbs:
+      - get
+      - update
+  - apiGroups:
+      - ""
+    resources:
+      - nodes
+    verbs:
+    - watch
+    - list
+    - get
+    - update
+  - apiGroups:
+    - ""
+    resources:
+      - namespaces
+      - pods
+      - services
+      - replicationcontrollers
+      - persistentvolumeclaims
+      - persistentvolumes
+    verbs:
+      - watch
+      - list
+      - get
+  - apiGroups:
+    - batch
+    resources:
+      - jobs
+      - cronjobs
+    verbs:
+      - watch
+      - list
+      - get
+  - apiGroups:
+    - batch
+    - extensions
+    resources:
+    - jobs
+    verbs:
+    - get
+    - list
+    - patch
+    - watch
+  - apiGroups:
+      - extensions
+    resources:
+      - replicasets
+      - daemonsets
+    verbs:
+      - watch
+      - list
+      - get
+  - apiGroups:
+      - policy
+    resources:
+      - poddisruptionbudgets
+    verbs:
+      - watch
+      - list
+  - apiGroups:
+    - apps
+    resources:
+    - daemonsets
+    - replicasets
+    - statefulsets
+    verbs:
+    - watch
+    - list
+    - get
+  - apiGroups:
+    - storage.k8s.io
+    resources:
+    - storageclasses
+    - csinodes
+    - csidrivers
+    - csistoragecapacities
+    verbs:
+    - watch
+    - list
+    - get
+  - apiGroups:
+      - ""
+    resources:
+      - configmaps
+    verbs:
+      - list
+      - watch
+  - apiGroups:
+    - coordination.k8s.io
+    resources:
+    - leases
+    verbs:
+    - create
+  - apiGroups:
+    - coordination.k8s.io
+    resourceNames:
+    - cluster-autoscaler
+    resources:
+    - leases
+    verbs:
+    - get
+    - update
 ---
+# Source: cluster-autoscaler/templates/role.yaml
 apiVersion: rbac.authorization.k8s.io/v1
 kind: Role
 metadata:
-  name: cluster-autoscaler
-  namespace: kube-system
   labels:
-    k8s-addon: cluster-autoscaler.addons.k8s.io
-    k8s-app: cluster-autoscaler
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
 rules:
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    verbs: ["create","list","watch"]
-  - apiGroups: [""]
-    resources: ["configmaps"]
-    resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
-    verbs: ["delete", "get", "update", "watch"]
-
+  - apiGroups:
+      - ""
+    resources:
+      - configmaps
+    verbs:
+      - create
+  - apiGroups:
+      - ""
+    resources:
+      - configmaps
+    resourceNames:
+      - cluster-autoscaler-status
+    verbs:
+      - delete
+      - get
+      - update
 ---
+# Source: cluster-autoscaler/templates/clusterrolebinding.yaml
 apiVersion: rbac.authorization.k8s.io/v1
 kind: ClusterRoleBinding
 metadata:
-  name: cluster-autoscaler
   labels:
-    k8s-addon: cluster-autoscaler.addons.k8s.io
-    k8s-app: cluster-autoscaler
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
 roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: ClusterRole
-  name: cluster-autoscaler
+  name: test-aws-cluster-autoscaler
 subjects:
   - kind: ServiceAccount
-    name: cluster-autoscaler
+    name: test-aws-cluster-autoscaler
     namespace: kube-system
-
 ---
+# Source: cluster-autoscaler/templates/rolebinding.yaml
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding
 metadata:
-  name: cluster-autoscaler
-  namespace: kube-system
   labels:
-    k8s-addon: cluster-autoscaler.addons.k8s.io
-    k8s-app: cluster-autoscaler
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
 roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: Role
-  name: cluster-autoscaler
+  name: test-aws-cluster-autoscaler
 subjects:
   - kind: ServiceAccount
-    name: cluster-autoscaler
+    name: test-aws-cluster-autoscaler
     namespace: kube-system
-
 ---
+# Source: cluster-autoscaler/templates/deployment.yaml
 apiVersion: apps/v1
 kind: Deployment
 metadata:
-  name: cluster-autoscaler
-  namespace: kube-system
   labels:
-    app: cluster-autoscaler
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
 spec:
   replicas: 1
   selector:
     matchLabels:
-      app: cluster-autoscaler
+      app.kubernetes.io/instance: "test"
+      app.kubernetes.io/name: "aws-cluster-autoscaler"
   template:
     metadata:
-      labels:
-        app: cluster-autoscaler
       annotations:
-        prometheus.io/scrape: 'true'
-        prometheus.io/port: '8085'
         cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
+      labels:
+        app.kubernetes.io/instance: "test"
+        app.kubernetes.io/name: "aws-cluster-autoscaler"
     spec:
-      priorityClassName: system-cluster-critical
-      securityContext:
-        runAsNonRoot: true
-        runAsUser: 65534
-        fsGroup: 65534
-      serviceAccountName: cluster-autoscaler
+      priorityClassName: "system-cluster-critical"
+      dnsPolicy: "ClusterFirst"
       containers:
-        - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
-          name: cluster-autoscaler
+        - name: aws-cluster-autoscaler
+          image: "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1"
+          imagePullPolicy: "IfNotPresent"
+          command:
+            - ./cluster-autoscaler
+            - --cloud-provider=aws
+            - --namespace=kube-system
+            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/jhubproto2
+            - --balance-similar-node-groups=true
+            - --logtostderr=true
+            - --skip-nodes-with-system-pods=false
+            - --stderrthreshold=info
+            - --v=4
+          env:
+            - name: AWS_REGION
+              value: "MY REGION"
+          livenessProbe:
+            httpGet:
+              path: /health-check
+              port: 8085
+          ports:
+            - containerPort: 8085
           resources:
             limits:
               cpu: 100m
@@ -156,22 +284,51 @@ spec:
             requests:
               cpu: 100m
               memory: 600Mi
-          command:
-            - ./cluster-autoscaler
-            - --v=4
-            - --stderrthreshold=info
-            - --cloud-provider=aws
-            - --skip-nodes-with-local-storage=false
-            - --expander=least-waste
-            - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/jhubproto2
-            - --balance-similar-node-groups
-            - --skip-nodes-with-system-pods=false
-          volumeMounts:
-            - name: ssl-certs
-              mountPath: /etc/ssl/certs/ca-certificates.crt #/etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes
-              readOnly: true
-          imagePullPolicy: "Always"
-      volumes:
-        - name: ssl-certs
-          hostPath:
-            path: "/etc/ssl/certs/ca-bundle.crt"
+      serviceAccountName: test-aws-cluster-autoscaler
+      tolerations:
+        []
+
+NOTES:
+To verify that cluster-autoscaler has started, run:
+
+  kubectl --namespace=kube-system get pods -l "app.kubernetes.io/name=aws-cluster-autoscaler,app.kubernetes.io/instance=test"
+---
+# Source: cluster-autoscaler/templates/service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+  labels:
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
+spec:
+  ports:
+    - port: 8085
+      protocol: TCP
+      targetPort: 8085
+      name: http
+  selector:
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+  type: "ClusterIP"
+---
+# Source: cluster-autoscaler/templates/pdb.yaml
+apiVersion: policy/v1beta1
+kind: PodDisruptionBudget
+metadata:
+  labels:
+    app.kubernetes.io/instance: "test"
+    app.kubernetes.io/name: "aws-cluster-autoscaler"
+    app.kubernetes.io/managed-by: "Helm"
+    helm.sh/chart: "cluster-autoscaler-9.16.2"
+  name: test-aws-cluster-autoscaler
+spec:
+  selector:
+    matchLabels:
+      app.kubernetes.io/instance: "test"
+      app.kubernetes.io/name: "aws-cluster-autoscaler"
+
+  maxUnavailable: 1
+---

maegul avatar Apr 06 '22 06:04 maegul

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jul 05 '22 07:07 k8s-triage-robot

I think I am having the exact same issue. Shame to see that you had to give up and not use the helm chart. If anyone has an idea how to fix this that would be amazing. I am mean while going to try some troubleshooting of my own.

InbarRose avatar Jul 26 '22 07:07 InbarRose

not sure if this would fix your issue. but actually i found a way to make it work. using eksctl create iamserviceaccount with SERVICE_ACCOUNT_NAME to create the service account from the policy: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#iam-policy then pass:

    --set rbac.serviceAccount.create=false \
    --set rbac.serviceAccount.name=SERVICE_ACCOUNT_NAME

to the helm install command. instead of rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn" this worked for me.

InbarRose avatar Jul 26 '22 08:07 InbarRose

Thanks @InbarRose . I’ll keep this in mind next time I loop back on this.

However, it’s important to mention that last I checked, the AWS docs are recommending the us off their custom Cindy file rather than the helm chart (as mentioned in the original post, link to docs: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html).

Relying on the AWS config is probably a more reliable approach until the cause of this issue is understood. Your suggestions might reveal that, but I don’t know.

maegul avatar Jul 27 '22 07:07 maegul

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Aug 26 '22 08:08 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Sep 25 '22 08:09 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Sep 25 '22 08:09 k8s-ci-robot