autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Cluster-autoscaler panics on leader election

Open dfroberg opened this issue 1 year ago • 9 comments

I have the helm chart installed via ArgoCD in EKS v1.26 with an IRSA role created in Terraform; it explodes on leader election, regardless if the replica count is one or more. The image tag is set as "v${local.cluster_version}.0"

Which component are you using?:

cluster-autoscaler

What version of the component are you using?:

Helm chart 9.29.0

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Client Version: v1.27.1
Kustomize Version: v5.0.1
Server Version: v1.26.4-eks-0a21954

What environment is this in?:

EKS

What did you expect to happen?:

I expect the autoscaler to not crash

What happened instead?:

It crashed with the error noted below.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

I0531 13:35:25.717322       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0531 13:35:25.717342       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.751µs
I0531 13:35:55.717894       1 static_autoscaler.go:276] Starting main loop
W0531 13:35:55.718957       1 clusterstate.go:428] AcceptableRanges have not been populated yet. Skip checking
I0531 13:35:55.719061       1 filter_out_schedulable.go:63] Filtering out schedulables
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x3cfca79]

goroutine 95 [running]:
k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling.(*HintingSimulator).findNode(0xc000e38df8, 0x5a2fdc0?, {0x5a2fdc0, 0xc000012470}, 0xc000c99900, 0x0?)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling/hinting_simulator.go:114 +0x159
k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling.(*HintingSimulator).TrySchedulePods(0x40a5c00?, {0x5a2fdc0, 0xc000012470}, {0xc003da83f8, 0x1, 0x4f2fca0?}, 0xc0008960d0?, 0x0)
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling/hinting_simulator.go:70 +0x2cf
k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor.(*filterOutSchedulablePodListProcessor).filterOutSchedulableByPacking(0xc000012510, {0xc003da83f8, 0x1, 0x1}, {0x5a2fdc0, 0xc000012470})
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor/filter_out_schedulable.go:101 +0x111
k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor.(*filterOutSchedulablePodListProcessor).Process(0x0?, 0xc001dc7400, {0xc003da83f8?, 0x1, 0x1})
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor/filter_out_schedulable.go:66 +0xd5
k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor.(*defaultPodListProcessor).Process(0xc000e562f0, 0xc0027849c0?, {0xc003da83f8?, 0x4?, 0x4?})
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor/pod_list_processor.go:45 +0x65
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc001485d60, {0x4?, 0xc000136710?, 0x87e1e20?})
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:477 +0x1ca3
main.run(0xc000c8a500?, {0x5a277b8, 0xc000c204b0})
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:442 +0x2cd
main.main.func2({0xe?, 0x0?})
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:529 +0x25
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
        /gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x11b

dfroberg avatar May 31 '23 13:05 dfroberg

Same here. this issue was adressed in 1.26.1

https://github.com/kubernetes/autoscaler/issues/5389

nidalhaddad avatar May 31 '23 14:05 nidalhaddad

I'm seeing this exact same issue in EKS v1.26.4-eks-0a21954, and I'm running v1.26.2

[Edit]: Nevermind, no I'm not. helm list -A is reporting App version of 1.26.2, but upon inspection of the pod and logs it is indeed still running 1.26.0. Not sure why there's a discrepancy, but that's a different issue. Helm chart 9.28.0 for what it's worth.

ZacharyTG avatar Jun 09 '23 01:06 ZacharyTG

Seeing the same issue on v1.26.4-eks-0a21954 running v1.27.1

Kampe avatar Jun 13 '23 05:06 Kampe

We tested a basical scale-up/down scenario for 1.27.1 on GCE, this shouldn't happen there. I suspect it's the same issue that @ZacharyTG mentions - helm says that it installed CA 1.27.1, but it actually installed 1.26.0 - which definitely has the issue.

I'm not familiar with the helm charts at all, @gjtempleton do you have an idea why this could happen?

towca avatar Jun 13 '23 10:06 towca

Have you got an example (suitably redacted) values you're providing to the chart as well as the output from the helm installation?

gjtempleton avatar Jun 13 '23 10:06 gjtempleton

Am I the only one still stuck on the crashloopbackoff here?

darthale avatar Aug 07 '23 14:08 darthale

I am on cluster-autoscaler helm chart version 9.29.4, which deploys autoscaler version1.27.2 and am still seeing the issue while trying to scale up nodes in AWS EKS. Intrestingly when I checked the Image of the pod it was using registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.0 so this could be the root cause - checking further - will update if able to fix this.

adiospeds avatar Nov 01 '23 17:11 adiospeds

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jan 31 '24 15:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot avatar Mar 01 '24 16:03 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-triage-robot avatar Mar 31 '24 17:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Mar 31 '24 17:03 k8s-ci-robot