autoscaler
autoscaler copied to clipboard
Autoscaler not scaling anything if one NodeGroup is down
Which component are you using?: cluster-autoscaler
What version of the component are you using?: v1.27.1
What k8s version are you using (kubectl version
)?: v1.23.12
What environment is this in?: OpenStack
What did you expect to happen?:
Cluster scaling is not halted when there is one NodeGroup containing only NotReady nodes. NodeGroups can scale independently from each other.
What happened instead?:
If there is a NodeGroup whose all nodes are in NotReady state it will prevent the entire cluster from scaling after the cluster-autoscaler is restarted.
NOTE: max-total-unready-percentage
and ok-total-unready-count
are not violated.
How to reproduce it (as minimally and precisely as possible):
- Create a new NodeGroup (let's call it NG1)
- Kill all Nodes in default worker NodeGroup
- Restart cluster-autoscaler
- Apply some deployment that requires a scale up
- NG1 fails to scale because there is no nodeinfo for default worker NodeGroup, error is logged
'Error: static_autoscaler.go:445] Failed to scale up: Could not compute total resources: No node info for: default-worker-08d2a1ab'
Anything else we need to know?:
What I found out is that the cluster autoscaler needs nodeInfos from all nodegroups to be able to scale anything. What happens is that after the cluster autoscaler is restarted MixedTemplateNodeInfoProvider.nodeInfoCache
is reset and then it lacks the info for the NodeGroup that is entirely down. That leads to this function returning an error when it encounters a missing nodeInfo:
func (m *Manager) coresMemoryTotal(ctx *context.AutoscalingContext, nodeInfos map[string]*schedulerframework.NodeInfo, nodesFromNotAutoscaledGroups []*corev1.Node) (int64, int64, errors.AutoscalerError) {
var coresTotal int64
var memoryTotal int64
for _, nodeGroup := range ctx.CloudProvider.NodeGroups() {
currentSize, err := nodeGroup.TargetSize()
if err != nil {
return 0, 0, errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to get node group size of %v: ", nodeGroup.Id())
}
nodeInfo, found := nodeInfos[nodeGroup.Id()]
if !found {
// @@@@@
// after cluster-autoscaler is restarted nodeInfos does not have an entry for the NodeGroup that is entirely NotReady
// @@@@@
return 0, 0, errors.NewAutoscalerError(errors.CloudProviderError, "No node info for: %s", nodeGroup.Id())
}
if currentSize > 0 {
nodeCPU, nodeMemory := utils.GetNodeCoresAndMemory(nodeInfo.Node())
coresTotal = coresTotal + int64(currentSize)*nodeCPU
memoryTotal = memoryTotal + int64(currentSize)*nodeMemory
}
}
for _, node := range nodesFromNotAutoscaledGroups {
cores, memory := utils.GetNodeCoresAndMemory(node)
coresTotal += cores
memoryTotal += memory
}
return coresTotal, memoryTotal, nil
}
and if this function returns an error, call stack hits this if statement which in the comment states that there is no reason to proceed with scale up:
func (m *Manager) ResourcesLeft(ctx *context.AutoscalingContext, nodeInfos map[string]*schedulerframework.NodeInfo, nodes []*corev1.Node) (Limits, errors.AutoscalerError) {
nodesFromNotAutoscaledGroups, err := utils.FilterOutNodesFromNotAutoscaledGroups(nodes, ctx.CloudProvider)
if err != nil {
return nil, err.AddPrefix("failed to filter out nodes which are from not autoscaled groups: ")
}
totalCores, totalMem, errCoresMem := m.coresMemoryTotal(ctx, nodeInfos, nodesFromNotAutoscaledGroups)
resourceLimiter, errgo := ctx.CloudProvider.GetResourceLimiter()
if errgo != nil {
return nil, errors.ToAutoscalerError(errors.CloudProviderError, errgo)
}
var totalResources map[string]int64
var totalResourcesErr error
if cloudprovider.ContainsCustomResources(resourceLimiter.GetResources()) {
totalResources, totalResourcesErr = m.customResourcesTotal(ctx, nodeInfos, nodesFromNotAutoscaledGroups)
}
resultScaleUpLimits := make(Limits)
for _, resource := range resourceLimiter.GetResources() {
max := resourceLimiter.GetMax(resource)
// we put only actual limits into final map. No entry means no limit.
if max > 0 {
if (resource == cloudprovider.ResourceNameCores || resource == cloudprovider.ResourceNameMemory) && errCoresMem != nil {
// core resource info missing - no reason to proceed with scale up
return Limits{}, errCoresMem
}
...
and by that, autoscaling of the entire cluster is halted.
Seems like the coresMemoryTotal
function couples nodegroups together in the context of autoscaling, because if one is completly down others can't scale up. So now imagine there is one nodegroup that is in completly different availability zone than other nodegroups, it goes down entirely and suddenly autoscaling of all nodegroups in all availability zones is blocked because of this one nodegroup. If this an intended behaviour?
I noticed that if I change the function that calculates the total resources from all nodegroups to skip missing nodegroups instead of returning an error, then autoscaling works again:
func (m *Manager) coresMemoryTotal(ctx *context.AutoscalingContext, nodeInfos map[string]*schedulerframework.NodeInfo, nodesFromNotAutoscaledGroups []*corev1.Node) (int64, int64, errors.AutoscalerError) {
var coresTotal int64
var memoryTotal int64
for _, nodeGroup := range ctx.CloudProvider.NodeGroups() {
currentSize, err := nodeGroup.TargetSize()
if err != nil {
return 0, 0, errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to get node group size of %v: ", nodeGroup.Id())
}
nodeInfo, found := nodeInfos[nodeGroup.Id()]
if !found {
continue // <- continue instead of returning an error if NG info is missing
//return 0, 0, errors.NewAutoscalerError(errors.CloudProviderError, "No node info for: %s", nodeGroup.Id())
}
...
I have no idea what side effects that might cause and I don't know what is the purpose of calculating total resources from all NodeGroups and erroring out if one of them is missing node infos.
I'd be very grateful for your comment on this issue 🙂
Hey,
any updates on this issue? A customer of us is facing the same issue, leaving the cluster unable to resize.
A solution to this problem would be greatly appreciated! Thanks!
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
- After 90d of inactivity,
lifecycle/stale
is applied - After 30d of inactivity since
lifecycle/stale
was applied,lifecycle/rotten
is applied - After 30d of inactivity since
lifecycle/rotten
was applied, the issue is closed
You can:
- Mark this issue as fresh with
/remove-lifecycle stale
- Close this issue with
/close
- Offer to help out with Issue Triage
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale