autoscaler
autoscaler copied to clipboard
Cluster-autoscaler panics on leader election
I have the helm chart installed via ArgoCD in EKS v1.26 with an IRSA role created in Terraform; it explodes on leader election, regardless if the replica count is one or more.
The image tag is set as "v${local.cluster_version}.0"
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Helm chart 9.29.0
Component version:
What k8s version are you using (kubectl version
)?:
kubectl version
Output
$ kubectl version Client Version: v1.27.1 Kustomize Version: v5.0.1 Server Version: v1.26.4-eks-0a21954
What environment is this in?:
EKS
What did you expect to happen?:
I expect the autoscaler to not crash
What happened instead?:
It crashed with the error noted below.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
I0531 13:35:25.717322 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
I0531 13:35:25.717342 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 7.751µs
I0531 13:35:55.717894 1 static_autoscaler.go:276] Starting main loop
W0531 13:35:55.718957 1 clusterstate.go:428] AcceptableRanges have not been populated yet. Skip checking
I0531 13:35:55.719061 1 filter_out_schedulable.go:63] Filtering out schedulables
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x3cfca79]
goroutine 95 [running]:
k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling.(*HintingSimulator).findNode(0xc000e38df8, 0x5a2fdc0?, {0x5a2fdc0, 0xc000012470}, 0xc000c99900, 0x0?)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling/hinting_simulator.go:114 +0x159
k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling.(*HintingSimulator).TrySchedulePods(0x40a5c00?, {0x5a2fdc0, 0xc000012470}, {0xc003da83f8, 0x1, 0x4f2fca0?}, 0xc0008960d0?, 0x0)
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/simulator/scheduling/hinting_simulator.go:70 +0x2cf
k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor.(*filterOutSchedulablePodListProcessor).filterOutSchedulableByPacking(0xc000012510, {0xc003da83f8, 0x1, 0x1}, {0x5a2fdc0, 0xc000012470})
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor/filter_out_schedulable.go:101 +0x111
k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor.(*filterOutSchedulablePodListProcessor).Process(0x0?, 0xc001dc7400, {0xc003da83f8?, 0x1, 0x1})
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor/filter_out_schedulable.go:66 +0xd5
k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor.(*defaultPodListProcessor).Process(0xc000e562f0, 0xc0027849c0?, {0xc003da83f8?, 0x4?, 0x4?})
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/podlistprocessor/pod_list_processor.go:45 +0x65
k8s.io/autoscaler/cluster-autoscaler/core.(*StaticAutoscaler).RunOnce(0xc001485d60, {0x4?, 0xc000136710?, 0x87e1e20?})
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/core/static_autoscaler.go:477 +0x1ca3
main.run(0xc000c8a500?, {0x5a277b8, 0xc000c204b0})
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:442 +0x2cd
main.main.func2({0xe?, 0x0?})
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/main.go:529 +0x25
created by k8s.io/client-go/tools/leaderelection.(*LeaderElector).Run
/gopath/src/k8s.io/autoscaler/cluster-autoscaler/vendor/k8s.io/client-go/tools/leaderelection/leaderelection.go:211 +0x11b
Same here. this issue was adressed in 1.26.1
https://github.com/kubernetes/autoscaler/issues/5389
I'm seeing this exact same issue in EKS v1.26.4-eks-0a21954, and I'm running v1.26.2
[Edit]: Nevermind, no I'm not. helm list -A
is reporting App version of 1.26.2, but upon inspection of the pod and logs it is indeed still running 1.26.0. Not sure why there's a discrepancy, but that's a different issue. Helm chart 9.28.0 for what it's worth.
Seeing the same issue on v1.26.4-eks-0a21954
running v1.27.1
We tested a basical scale-up/down scenario for 1.27.1
on GCE, this shouldn't happen there. I suspect it's the same issue that @ZacharyTG mentions - helm says that it installed CA 1.27.1
, but it actually installed 1.26.0
- which definitely has the issue.
I'm not familiar with the helm charts at all, @gjtempleton do you have an idea why this could happen?
Have you got an example (suitably redacted) values you're providing to the chart as well as the output from the helm installation?
Am I the only one still stuck on the crashloopbackoff here?
I am on cluster-autoscaler helm chart version 9.29.4
, which deploys autoscaler version1.27.2
and am still seeing the issue while trying to scale up nodes in AWS EKS. Intrestingly when I checked the Image of the pod it was using registry.k8s.io/autoscaling/cluster-autoscaler:v1.26.0
so this could be the root cause - checking further - will update if able to fix this.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle rotten
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
In response to this:
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied- After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied- After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closedYou can:
- Reopen this issue with
/reopen
- Mark this issue as fresh with
/remove-lifecycle rotten
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.