autoscaler
autoscaler copied to clipboard
Autoscaler helm chart fails with WebIdentityErr: failed to retrieve credentials even though AWS full config works with the same parameters
Which component are you using?:
- Cluster autoscaler, helm chart
- helm chart (I suspect)
What version of the component are you using?:
- helm chart: 9.16.2
- cluster-autoscaler: v1.20.0 | v1.21.0 | v1.21.1
Component version: see above
What k8s version are you using (kubectl version
)?: 1.20
kubectl version
Output
$ kubectl version
What environment is this in?: AWS EKS
What did you expect to happen?: for the autoscaling pod to not crash (and then work 😄 )
What happened instead?:
- Within ~1 minute the autoscaler pod crashes and then quickly enters into
CrashLoopBackOff
status - Inspecting the log of the autoscaler pod reveals the cause to (most likely) be that the serviceaccount role and policy has not been properly associated or used (see my final comments below):
E0406 02:32:59.407595 1 aws_manager.go:265] Failed to regenerate ASG cache: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials
caused by: AccessDenied: Not authorized to perform sts:AssumeRoleWithWebIdentity
- This appears immediately after the pod has gotten/requested all the EC2 types
How to reproduce it (as minimally and precisely as possible):
- create simple default cluster on AWS EKS with
eksctl
- Add an OIDC provider as instructed here on AWS Docs
- create an IAM policy and role as instructed here on AWS Docs
- install this helm chart
- Create a values config file like below and deploy a release
autoDiscovery:
clusterName: jhubproto2
cloudProvider: aws
awsRegion: REGION
image:
tag: v1.21.1
# small bump in resources from default inline with AWS recommendation (more memory: 600Mi)
# seemed to be necessary, if default of 300m is used, seemed to crash
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
# give the autoscaler the required iam role+policy to scale instances
rbac:
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: ROLE-ARN-FROM-ABOVE-STEP
# to ensure that the autoscaler pod itself is not scaled away?
podAnnotations:
# false needs to be string (don't know why)
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
# add arguments to the autoscaler container itself
extraArgs:
logtostderr: true
stderrthreshold: info
v: 4
# ... evenly distribute pods across nodes and AZs
balance-similar-node-groups: true
# ... allows for scaling to zero
skip-nodes-with-system-pods: false
Anything else we need to know?:
-
Yes ... most importantly ...
-
The autoscaler works just fine, and the error in the logs reported above disappears when the complete yaml config file recommended by AWS here is used instead of this helm chart, where everything else is kept the same and the same parameters are used.
-
This complete config can be found here, in this repo under the
autoscaler/cluster-autoscaler/cloudprovider/aws
section. -
Below, I've provided a manual
diff
I've done between the rendered values of the helm chart and this complete config (where I had to also reorder some of the sections to that they line up). -
Without understanding all of it, some of the things that stand out to me as possible issues:
- the AWS full config uses
name: cluster-autoscaler
while the chart uses an arbitrary many times instead. Given that the AWS role is conditional on a service account ID of...cluster-autoscaler
... this might be a problem. - Similarly, the full config specifies the namespace more often than the chart.
- the full config mounts some cert directories (eg
/etc/ssl/certs/ca-certificates.crt
) where the chart appears not to be doing the same thing.
- the AWS full config uses
-
Without wanting to complain, it does seem that there is some disconnect here between this chart and AWS that is rather confusing to the non-expert. It's a shame that AWS and this chart seem to be somewhat incompatible at the moment (going off of the other issues i've seen reported regarding AWS) given how much nicer it is to use
helm
. -
At the moment though, unless I've done something silly here, I'd say it's best to just do what AWS recommend rather than try to use the autoscaler helm chart on AWS.
- This diff was made from the AWS complete config file (
cluster-autoscaler-autodiscover.yaml
) to the rendered values of the helm chart (autoscaler_dry_run.yaml
) ... ie, changes are made by the helm chart to the full AWS config file.
diff --git a/cluster-autoscaler-autodiscover.yaml b/autoscaler_dry_run.yaml
index 61027ce..717f535 100644
--- a/cluster-autoscaler-autodiscover.yaml
+++ b/autoscaler_dry_run.yaml
@@ -1,154 +1,282 @@
+NAME: test
+LAST DEPLOYED: Wed Apr 6 15:32:45 2022
+NAMESPACE: kube-system
+STATUS: pending-install
+REVISION: 1
+TEST SUITE: None
+HOOKS:
+MANIFEST:
---
+# Source: cluster-autoscaler/templates/serviceaccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
- k8s-addon: cluster-autoscaler.addons.k8s.io
- k8s-app: cluster-autoscaler
- name: cluster-autoscaler
- namespace: kube-system
- annotations:
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
+ annotations:
eks.amazonaws.com/role-arn: ROLE-ARN
+automountServiceAccountToken: true
---
+# Source: cluster-autoscaler/templates/clusterrole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
- name: cluster-autoscaler
labels:
- k8s-addon: cluster-autoscaler.addons.k8s.io
- k8s-app: cluster-autoscaler
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
rules:
- - apiGroups: [""]
- resources: ["events", "endpoints"]
- verbs: ["create", "patch"]
- - apiGroups: [""]
- resources: ["pods/eviction"]
- verbs: ["create"]
- - apiGroups: [""]
- resources: ["pods/status"]
- verbs: ["update"]
- - apiGroups: [""]
- resources: ["endpoints"]
- resourceNames: ["cluster-autoscaler"]
- verbs: ["get", "update"]
- - apiGroups: [""]
- resources: ["nodes"]
- verbs: ["watch", "list", "get", "update"]
- - apiGroups: [""]
- resources:
- - "namespaces"
- - "pods"
- - "services"
- - "replicationcontrollers"
- - "persistentvolumeclaims"
- - "persistentvolumes"
- verbs: ["watch", "list", "get"]
- - apiGroups: ["extensions"]
- resources: ["replicasets", "daemonsets"]
- verbs: ["watch", "list", "get"]
- - apiGroups: ["policy"]
- resources: ["poddisruptionbudgets"]
- verbs: ["watch", "list"]
- - apiGroups: ["apps"]
- resources: ["statefulsets", "replicasets", "daemonsets"]
- verbs: ["watch", "list", "get"]
- - apiGroups: ["storage.k8s.io"]
- resources: ["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
- verbs: ["watch", "list", "get"]
- - apiGroups: ["batch", "extensions"]
- resources: ["jobs"]
- verbs: ["get", "list", "watch", "patch"]
- - apiGroups: ["coordination.k8s.io"]
- resources: ["leases"]
- verbs: ["create"]
- - apiGroups: ["coordination.k8s.io"]
- resourceNames: ["cluster-autoscaler"]
- resources: ["leases"]
- verbs: ["get", "update"]
+ - apiGroups:
+ - ""
+ resources:
+ - events
+ - endpoints
+ verbs:
+ - create
+ - patch
+ - apiGroups:
+ - ""
+ resources:
+ - pods/eviction
+ verbs:
+ - create
+ - apiGroups:
+ - ""
+ resources:
+ - pods/status
+ verbs:
+ - update
+ - apiGroups:
+ - ""
+ resources:
+ - endpoints
+ resourceNames:
+ - cluster-autoscaler
+ verbs:
+ - get
+ - update
+ - apiGroups:
+ - ""
+ resources:
+ - nodes
+ verbs:
+ - watch
+ - list
+ - get
+ - update
+ - apiGroups:
+ - ""
+ resources:
+ - namespaces
+ - pods
+ - services
+ - replicationcontrollers
+ - persistentvolumeclaims
+ - persistentvolumes
+ verbs:
+ - watch
+ - list
+ - get
+ - apiGroups:
+ - batch
+ resources:
+ - jobs
+ - cronjobs
+ verbs:
+ - watch
+ - list
+ - get
+ - apiGroups:
+ - batch
+ - extensions
+ resources:
+ - jobs
+ verbs:
+ - get
+ - list
+ - patch
+ - watch
+ - apiGroups:
+ - extensions
+ resources:
+ - replicasets
+ - daemonsets
+ verbs:
+ - watch
+ - list
+ - get
+ - apiGroups:
+ - policy
+ resources:
+ - poddisruptionbudgets
+ verbs:
+ - watch
+ - list
+ - apiGroups:
+ - apps
+ resources:
+ - daemonsets
+ - replicasets
+ - statefulsets
+ verbs:
+ - watch
+ - list
+ - get
+ - apiGroups:
+ - storage.k8s.io
+ resources:
+ - storageclasses
+ - csinodes
+ - csidrivers
+ - csistoragecapacities
+ verbs:
+ - watch
+ - list
+ - get
+ - apiGroups:
+ - ""
+ resources:
+ - configmaps
+ verbs:
+ - list
+ - watch
+ - apiGroups:
+ - coordination.k8s.io
+ resources:
+ - leases
+ verbs:
+ - create
+ - apiGroups:
+ - coordination.k8s.io
+ resourceNames:
+ - cluster-autoscaler
+ resources:
+ - leases
+ verbs:
+ - get
+ - update
---
+# Source: cluster-autoscaler/templates/role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
- name: cluster-autoscaler
- namespace: kube-system
labels:
- k8s-addon: cluster-autoscaler.addons.k8s.io
- k8s-app: cluster-autoscaler
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
rules:
- - apiGroups: [""]
- resources: ["configmaps"]
- verbs: ["create","list","watch"]
- - apiGroups: [""]
- resources: ["configmaps"]
- resourceNames: ["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
- verbs: ["delete", "get", "update", "watch"]
-
+ - apiGroups:
+ - ""
+ resources:
+ - configmaps
+ verbs:
+ - create
+ - apiGroups:
+ - ""
+ resources:
+ - configmaps
+ resourceNames:
+ - cluster-autoscaler-status
+ verbs:
+ - delete
+ - get
+ - update
---
+# Source: cluster-autoscaler/templates/clusterrolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
- name: cluster-autoscaler
labels:
- k8s-addon: cluster-autoscaler.addons.k8s.io
- k8s-app: cluster-autoscaler
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
- name: cluster-autoscaler
+ name: test-aws-cluster-autoscaler
subjects:
- kind: ServiceAccount
- name: cluster-autoscaler
+ name: test-aws-cluster-autoscaler
namespace: kube-system
-
---
+# Source: cluster-autoscaler/templates/rolebinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
- name: cluster-autoscaler
- namespace: kube-system
labels:
- k8s-addon: cluster-autoscaler.addons.k8s.io
- k8s-app: cluster-autoscaler
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
- name: cluster-autoscaler
+ name: test-aws-cluster-autoscaler
subjects:
- kind: ServiceAccount
- name: cluster-autoscaler
+ name: test-aws-cluster-autoscaler
namespace: kube-system
-
---
+# Source: cluster-autoscaler/templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
- name: cluster-autoscaler
- namespace: kube-system
labels:
- app: cluster-autoscaler
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
- app: cluster-autoscaler
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
template:
metadata:
- labels:
- app: cluster-autoscaler
annotations:
- prometheus.io/scrape: 'true'
- prometheus.io/port: '8085'
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
+ labels:
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
spec:
- priorityClassName: system-cluster-critical
- securityContext:
- runAsNonRoot: true
- runAsUser: 65534
- fsGroup: 65534
- serviceAccountName: cluster-autoscaler
+ priorityClassName: "system-cluster-critical"
+ dnsPolicy: "ClusterFirst"
containers:
- - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.0
- name: cluster-autoscaler
+ - name: aws-cluster-autoscaler
+ image: "k8s.gcr.io/autoscaling/cluster-autoscaler:v1.21.1"
+ imagePullPolicy: "IfNotPresent"
+ command:
+ - ./cluster-autoscaler
+ - --cloud-provider=aws
+ - --namespace=kube-system
+ - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/jhubproto2
+ - --balance-similar-node-groups=true
+ - --logtostderr=true
+ - --skip-nodes-with-system-pods=false
+ - --stderrthreshold=info
+ - --v=4
+ env:
+ - name: AWS_REGION
+ value: "MY REGION"
+ livenessProbe:
+ httpGet:
+ path: /health-check
+ port: 8085
+ ports:
+ - containerPort: 8085
resources:
limits:
cpu: 100m
@@ -156,22 +284,51 @@ spec:
requests:
cpu: 100m
memory: 600Mi
- command:
- - ./cluster-autoscaler
- - --v=4
- - --stderrthreshold=info
- - --cloud-provider=aws
- - --skip-nodes-with-local-storage=false
- - --expander=least-waste
- - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/jhubproto2
- - --balance-similar-node-groups
- - --skip-nodes-with-system-pods=false
- volumeMounts:
- - name: ssl-certs
- mountPath: /etc/ssl/certs/ca-certificates.crt #/etc/ssl/certs/ca-bundle.crt for Amazon Linux Worker Nodes
- readOnly: true
- imagePullPolicy: "Always"
- volumes:
- - name: ssl-certs
- hostPath:
- path: "/etc/ssl/certs/ca-bundle.crt"
+ serviceAccountName: test-aws-cluster-autoscaler
+ tolerations:
+ []
+
+NOTES:
+To verify that cluster-autoscaler has started, run:
+
+ kubectl --namespace=kube-system get pods -l "app.kubernetes.io/name=aws-cluster-autoscaler,app.kubernetes.io/instance=test"
+---
+# Source: cluster-autoscaler/templates/service.yaml
+apiVersion: v1
+kind: Service
+metadata:
+ labels:
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
+spec:
+ ports:
+ - port: 8085
+ protocol: TCP
+ targetPort: 8085
+ name: http
+ selector:
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ type: "ClusterIP"
+---
+# Source: cluster-autoscaler/templates/pdb.yaml
+apiVersion: policy/v1beta1
+kind: PodDisruptionBudget
+metadata:
+ labels:
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+ app.kubernetes.io/managed-by: "Helm"
+ helm.sh/chart: "cluster-autoscaler-9.16.2"
+ name: test-aws-cluster-autoscaler
+spec:
+ selector:
+ matchLabels:
+ app.kubernetes.io/instance: "test"
+ app.kubernetes.io/name: "aws-cluster-autoscaler"
+
+ maxUnavailable: 1
+---
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
I think I am having the exact same issue. Shame to see that you had to give up and not use the helm chart. If anyone has an idea how to fix this that would be amazing. I am mean while going to try some troubleshooting of my own.
not sure if this would fix your issue.
but actually i found a way to make it work.
using eksctl create iamserviceaccount
with SERVICE_ACCOUNT_NAME to create the service account from the policy: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#iam-policy
then pass:
--set rbac.serviceAccount.create=false \
--set rbac.serviceAccount.name=SERVICE_ACCOUNT_NAME
to the helm install command. instead of rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"
this worked for me.
Thanks @InbarRose . I’ll keep this in mind next time I loop back on this.
However, it’s important to mention that last I checked, the AWS docs are recommending the us off their custom Cindy file rather than the helm chart (as mentioned in the original post, link to docs: https://docs.aws.amazon.com/eks/latest/userguide/autoscaling.html).
Relying on the AWS config is probably a more reliable approach until the cause of this issue is understood. Your suggestions might reveal that, but I don’t know.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.