autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Cluster Autoscaler pod fails with error "MissingRegion"

Open sivachandran-s opened this issue 1 year ago • 3 comments

cluster-autoscaler: 1.30

Component version:

What k8s version are you using (kubectl version)?:

Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.4-eks-a737599

kubectl version Output
$ kubectl version

What environment is this in?: Test

What did you expect to happen?:

I1014 17:54:22.064047 1 main.go:644] Cluster Autoscaler 1.30.0 I1014 17:54:22.155804 1 leaderelection.go:250] attempting to acquire leader lease kube-system/cluster-autoscaler... I1014 17:54:22.168685 1 leaderelection.go:260] successfully acquired lease kube-system/cluster-autoscaler I1014 17:54:22.169026 1 event_sink_logging_wrapper.go:48] Event(v1.ObjectReference{Kind:"Lease", Namespace:"kube-system", Name:"cluster-autoscaler", UID:"b82b9120-4fc3-4bc2-8b92-21daf9dd151f", APIVersion:"coordination.k8s.io/v1", ResourceVersion:"19453", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' cluster-autoscaler-7c5484cd44-59xj8 became leader I1014 17:54:22.251583 1 framework.go:373] "the scheduler starts to work with those plugins" Plugins={"PreEnqueue":{"Enabled":[{"Name":"SchedulingGates","Weight":0}],"Disabled":null},"QueueSort":{"Enabled":[{"Name":"PrioritySort","Weight":0}],"Disabled":null},"PreFilter":{"Enabled":[{"Name":"NodeAffinity","Weight":0},{"Name":"NodePorts","Weight":0},{"Name":"NodeResourcesFit","Weight":0},{"Name":"VolumeRestrictions","Weight":0},{"Name":"EBSLimits","Weight":0},{"Name":"GCEPDLimits","Weight":0},{"Name":"NodeVolumeLimits","Weight":0},{"Name":"AzureDiskLimits","Weight":0},{"Name":"VolumeBinding","Weight":0},{"Name":"VolumeZone","Weight":0},{"Name":"PodTopologySpread","Weight":0},{"Name":"InterPodAffinity","Weight":0}],"Disabled":null},"Filter":{"Enabled":[{"Name":"NodeUnschedulable","Weight":0},{"Name":"NodeName","Weight":0},{"Name":"TaintToleration","Weight":0},{"Name":"NodeAffinity","Weight":0},{"Name":"NodePorts","Weight":0},{"Name":"NodeResourcesFit","Weight":0},{"Name":"VolumeRestrictions","Weight":0},{"Name":"EBSLimits","Weight":0},{"Name":"GCEPDLimits","Weight":0},{"Name":"NodeVolumeLimits","Weight":0},{"Name":"AzureDiskLimits","Weight":0},{"Name":"VolumeBinding","Weight":0},{"Name":"VolumeZone","Weight":0},{"Name":"PodTopologySpread","Weight":0},{"Name":"InterPodAffinity","Weight":0}],"Disabled":null},"PostFilter":{"Enabled":[{"Name":"DefaultPreemption","Weight":0}],"Disabled":null},"PreScore":{"Enabled":[{"Name":"TaintToleration","Weight":0},{"Name":"NodeAffinity","Weight":0},{"Name":"NodeResourcesFit","Weight":0},{"Name":"VolumeBinding","Weight":0},{"Name":"PodTopologySpread","Weight":0},{"Name":"InterPodAffinity","Weight":0},{"Name":"NodeResourcesBalancedAllocation","Weight":0}],"Disabled":null},"Score":{"Enabled":[{"Name":"TaintToleration","Weight":3},{"Name":"NodeAffinity","Weight":2},{"Name":"NodeResourcesFit","Weight":1},{"Name":"VolumeBinding","Weight":1},{"Name":"PodTopologySpread","Weight":2},{"Name":"InterPodAffinity","Weight":2},{"Name":"NodeResourcesBalancedAllocation","Weight":1},{"Name":"ImageLocality","Weight":1}],"Disabled":null},"Reserve":{"Enabled":[{"Name":"VolumeBinding","Weight":0}],"Disabled":null},"Permit":{"Enabled":null,"Disabled":null},"PreBind":{"Enabled":[{"Name":"VolumeBinding","Weight":0}],"Disabled":null},"Bind":{"Enabled":[{"Name":"DefaultBinder","Weight":0}],"Disabled":null},"PostBind":{"Enabled":null,"Disabled":null},"MultiPoint":{"Enabled":null,"Disabled":null}} I1014 17:54:22.265983 1 cloud_provider_builder.go:30] Building aws cloud provider. E1014 17:54:25.405156 1 aws_cloud_provider.go:433] Failed to generate AWS EC2 Instance Types: MissingRegion: could not find region configuration, falling back to static list with last update time: 2024-04-08

What happened instead?:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?: cluster-autoscaler pod is crashing in our setup with the missing Region error , how to solve it

Also when i try to deploy 1.29 EKS version and 1.29 cluster-autoscaler version i am not seeing any issue even when i tried to perform the EKS upgrade from 1.29 to 1.30 i am not seeing this issue either . Only in the fresh install of 1.30 version of EKS and cluster-autoscaler i am getting the reported issue.

### Tasks
### Tasks

sivachandran-s avatar Oct 14 '24 17:10 sivachandran-s

/area cluster-autoscaler

adrianmoisey avatar Oct 14 '24 19:10 adrianmoisey

i started experiecing this issues also, i added this environment variable block: env { name = "AWS_REGION" value = "eu-west-1"
}

  but now I am getting this error:
  
  I1016 10:13:31.629902       1 auto_scaling_groups.go:360] Regenerating instance to ASG map for ASG names: []

I1016 10:13:31.629918 1 auto_scaling_groups.go:367] Regenerating instance to ASG map for ASG tags: map[k8s.io/cluster-autoscaler/enabled: k8s.io/cluster-autoscaler/fcmb-stg-tco0001-cluster:] I1016 10:13:34.932762 1 trace.go:219] Trace[774965466]: "Reflector ListAndWatch" name:k8s.io/autoscaler/cluster-autoscaler/utils/kubernetes/listers.go:212 (16-Oct-2024 10:13:24.533) (total time: 10398ms): Trace[774965466]: ---"Objects listed" error: 10398ms (10:13:34.932) Trace[774965466]: ---"Resource version extracted" 0ms (10:13:34.932) Trace[774965466]: ---"Objects extracted" 0ms (10:13:34.932) Trace[774965466]: ---"SyncWith done" 0ms (10:13:34.932) Trace[774965466]: ---"Resource version updated" 0ms (10:13:34.932) Trace[774965466]: [10.398963846s] [10.398963846s] END I1016 10:13:35.129666 1 trace.go:219] Trace[1852186258]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:150 (16-Oct-2024 10:13:24.530) (total time: 10598ms): Trace[1852186258]: ---"Objects listed" error: 10507ms (10:13:35.038) Trace[1852186258]: ---"Resource version extracted" 0ms (10:13:35.038) Trace[1852186258]: ---"Objects extracted" 90ms (10:13:35.128) Trace[1852186258]: ---"SyncWith done" 0ms (10:13:35.129) Trace[1852186258]: ---"Resource version updated" 0ms (10:13:35.129) Trace[1852186258]: [10.598647288s] [10.598647288s] END E1016 10:13:36.828775 1 aws_manager.go:125] Failed to regenerate ASG cache: NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors F1016 10:13:36.828823 1 aws_cloud_provider.go:419] Failed to create AWS Manager: NoCredentialProviders: no valid providers in chain. Deprecated. For verbose messaging see aws.Config.CredentialsChainVerboseErrors

layor2257 avatar Oct 16 '24 10:10 layor2257

One workaround which i followed is updating the EKS AMI type to "AL2_x86_64" instead of using the default type: image .

sivachandran-s avatar Oct 16 '24 13:10 sivachandran-s

I also got this error (the same logs also with AWS_REGION env). The error does not appears with clusters upgraded (I tried to upgrade from 1.29 in series), only on new clusters.

EvgeniiIakubov avatar Nov 25 '24 18:11 EvgeniiIakubov

@sivachandran-s the workaround which you followed is updating the EKS AMI type to "AL2_x86_64" instead of using the default type is working fine, it actually fixed that issue. But what is the permanent fix of this? Are you aware of it or anybody know, please let us know here

venu-ibex-9 avatar Feb 05 '25 09:02 venu-ibex-9

Unfortunately in the near future kubernetes v1.33 will be required in AWS, which will force us to use the AL2023-x86_64 images, which are already being applied by default to new clusters. When this happens, autoscaler will be broken without an available workaround.

I just ran into this myself, has a more permanent fix been found?

cwardcode avatar Apr 25 '25 20:04 cwardcode

any update on this? as 4 months left people should be already started migrating to AL2023 due to the EKS AMI end-of-support.

revawiki avatar May 16 '25 10:05 revawiki

I've found a workaround.

  • Install the Amazon EKS Pod Identity Agent to the cluster
  • Assign the required IAM policy to a new IAM role, where you specify the cluster-autoscaler Service Account in the trust-relationship.
  • Annotate the cluster-autoscaler Service Account with the new role ARN.

It's some extra step, but at least it's working with AL2023 images. I followed these docs: CA_with_AWS_IAM_OIDC IAM roles for service accounts

Extra info about the issue: https://github.com/awslabs/amazon-eks-ami/issues/1696 For IMDSv2, the default hop count for managed node groups is set to 1. This means that containers won't have access to the node's credentials using IMDS. If you require container access to the node's credentials, you can still do so by manually overriding the HttpPutResponseHopLimit in a custom EC2 launch template, increasing it to 2, and by using EKS Pod Identity.

volford-bence avatar May 19 '25 11:05 volford-bence

@adrianmoisey (since you tagged this originally) - Are there any updates here that do not require creating new IAM policies or roles? Time's quickly running out, so having an official response would be greatly appreciated

cwardcode avatar Jul 10 '25 16:07 cwardcode

@adrianmoisey (since you tagged this originally) - Are there any updates here that do not require creating new IAM policies or roles? Time's quickly running out, so having an official response would be greatly appreciated

I don't work on the cluster-autoscaler, so I can't help.

adrianmoisey avatar Jul 10 '25 16:07 adrianmoisey

Another workaround is to enable IMDSv1 via the launch template of the node group. I also increased the hop limit to 3 but not sure if that's needed.

joelngwt avatar Aug 11 '25 04:08 joelngwt

This is an AWS provider issue, so the right people to tag are the @gjtempleton and @drmorr0 who are currently in OWNERS file.

Ref: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/OWNERS

jackfrancis avatar Aug 11 '25 16:08 jackfrancis

any updates on this one ? EKS 1.33 doesn't work, even when updating to registry.k8s.io/autoscaling/cluster-autoscaler:v1.33.0

Failed to regenerate ASG cache: MissingRegion: could not find region configuration Failed to create AWS Manager: MissingRegion: could not find region configuration

keraudins avatar Sep 04 '25 13:09 keraudins

I am also getting this issue using cluster-autoscaler:v1.33.0 with node pool using Amazon Linux 2023.7.20250609 image. Any updates?

cwh-hcl avatar Sep 29 '25 15:09 cwh-hcl

We are forced to upgrade EKS cluster to newer version to avoid the extended support pricing and we are now not having the option to use older version of AMI AL2_x86_64. We only have option with AL2023-x86_64 but this breaks our cluster-autoscaler. It has been a year now since the ticket was opened and no fix version for this.

I'm experienced the same issue with cluster-autoscaler-1.34.1. Anyone has work around to get the cluster-autoscaler running?

Thanks

tritu-cisco avatar Oct 17 '25 06:10 tritu-cisco

We are forced to upgrade EKS cluster to newer version to avoid the extended support pricing and we are now not having the option to use older version of AMI AL2_x86_64. We only have option with AL2023-x86_64 but this breaks our cluster-autoscaler. It has been a year now since the ticket was opened and no fix version for this.

I'm experienced the same issue with cluster-autoscaler-1.34.1. Anyone has work around to get the cluster-autoscaler running?

Thanks

I was able to get cluster autoscaling working again on the newer AL2023 nodes by enabling IRSA on our EKS cluster by creating an IAM OIDC Identity Provider for the cluster, and then creating an IAM policy for the cluster scaler, an IAM role using this new policy with correct trust policy. Once the role is created I annotated the cluster service account "cluster-autoscaler" to use the new role, restart the k8s deployment and viola!

Hopefully this helps.

Helpful documentation of this process:

  • https://docs.aws.amazon.com/eks/latest/best-practices/cas.html
  • https://builder.aws.com/content/2a9qUKMTGUM6DkFdi0dNwtQnAke/cluster-autoscaler-configure-on-aws-eks-124

cwh-hcl avatar Oct 20 '25 13:10 cwh-hcl

Thank you @cwh-hcl, feel free to re-open (anyone on this thread) if there are more open issues here.

jackfrancis avatar Oct 20 '25 22:10 jackfrancis

Thanks for the tips and links @cwh-hcl

We got the cluster up and running with the Amazon Linux 2023 (x86_64) Standard AMI now.

I just followed the link https://builder.aws.com/content/2a9qUKMTGUM6DkFdi0dNwtQnAke/cluster-autoscaler-configure-on-aws-eks-124 and created the new Role Name : EKS_Autoscaler and used this on the cluster-autoscaler-autodiscover deployment.

tritu$ diff cluster-autoscaler-autodiscover-CALICO-PRD-EKS.yaml cluster-autoscaler-autodiscover-CALICO-PRD-EKS-new.yaml 7a8,9

annotations: eks.amazonaws.com/role-arn: arn:aws:iam::<my_id>:role/EKS_Autoscaler tritu$

Something must had changes on the Amazon Linux 2023 (x86_64) Standard AMI that now it needs ServiceAccount role.

The same deployment works fine for cluster-autoscaler when deploying on Bottlerocket (BOTTLEROCKET_x86_64) AMI but we can't use this BOTTLEROCKET_x86_64 because of this bug https://github.com/bottlerocket-os/bottlerocket/issues/4022.

Happy that we got it's working back on the Amazon Linux 2023 (x86_64) AMI now. Thanks

tritu-cisco avatar Oct 21 '25 04:10 tritu-cisco