autoscaler
autoscaler copied to clipboard
Not autoscaled node groups are treated as deleted
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
Component version: running from current HEAD
What k8s version are you using (kubectl version
)?: 1.24
kubectl version
Output
$ kubectl version
What environment is this in?: GKE
What did you expect to happen?: CA should treat upcoming nodes in non-autoscaled node groups as upcoming, not deleted, which would prevent unnecessary scale up.
What happened instead?: CA triggers a scale up even though there are nodes in non-autoscaled node groups that could run the pods.
How to reproduce it (as minimally and precisely as possible): In GKE, create a cluster with default nodepool that is not autoscaled and enable NAP. Observe NAP create a new nodepool. Sometimes it won't happen, if scheduler manages to schedule all the pods before CA kicks in.
Anything else we need to know?: The way deleted nodes are detected changed in https://github.com/kubernetes/autoscaler/pull/4896, we should probably roll it back and figure this problem out before reapplying the change.
/cc @fookenc @MaciekPytel
Chatted with @MaciekPytel about this. It should be sufficient to detect not autoscaled node groups (and stop marking them as deleted) in the same way scale down is doing it: by checking if NodeGroupForNode(node)
is nil: https://github.com/kubernetes/autoscaler/blob/5745044ddf4378c1aae16c47fb173b1abf709c8e/cluster-autoscaler/core/scaledown/legacy/legacy.go#L313-L322
Perhaps worth wrapping this check into a function and using it in both places.
Hi @x13n & @MaciekPytel,
From my local testing, I'm not sure that NodeGroupForNode will solve the issue. I found that after a node is deleted from the cloud provider, the NodeGroupForNode also returns nil. Unfortunately, this makes deleted nodes and not autoscaled nodes appear the same.
Is there another way to determine not autoscaled nodes?
I've submitted a new PR #5054 to reintroduce the code changes that were reverted and address the issue detailed here. The changes also include new scenarios in the test case for this functionality. It includes not autoscaled nodes (nodes without a node group) to ensure that they are no longer incorrectly flagged as deleted.
Please review the new changes and notify if there are areas of concern or improvement.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue or PR as fresh with
/remove-lifecycle stale
- Mark this issue or PR as rotten with
/lifecycle rotten
- Close this issue or PR with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
This was fixed.
/close
@x13n: Closing this issue.
In response to this:
This was fixed.
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.