autoscaler icon indicating copy to clipboard operation
autoscaler copied to clipboard

Autoscaler not scaling anything if one NodeGroup is down

Open H3Cki opened this issue 1 year ago • 4 comments

Which component are you using?: cluster-autoscaler

What version of the component are you using?: v1.27.1

What k8s version are you using (kubectl version)?: v1.23.12

What environment is this in?: OpenStack

What did you expect to happen?:

Cluster scaling is not halted when there is one NodeGroup containing only NotReady nodes. NodeGroups can scale independently from each other.

What happened instead?:

If there is a NodeGroup whose all nodes are in NotReady state it will prevent the entire cluster from scaling after the cluster-autoscaler is restarted.

NOTE: max-total-unready-percentage and ok-total-unready-count are not violated.

How to reproduce it (as minimally and precisely as possible):

  1. Create a new NodeGroup (let's call it NG1)
  2. Kill all Nodes in default worker NodeGroup
  3. Restart cluster-autoscaler
  4. Apply some deployment that requires a scale up
  5. NG1 fails to scale because there is no nodeinfo for default worker NodeGroup, error is logged 'Error: static_autoscaler.go:445] Failed to scale up: Could not compute total resources: No node info for: default-worker-08d2a1ab'

Anything else we need to know?:

What I found out is that the cluster autoscaler needs nodeInfos from all nodegroups to be able to scale anything. What happens is that after the cluster autoscaler is restarted MixedTemplateNodeInfoProvider.nodeInfoCache is reset and then it lacks the info for the NodeGroup that is entirely down. That leads to this function returning an error when it encounters a missing nodeInfo:

func (m *Manager) coresMemoryTotal(ctx *context.AutoscalingContext, nodeInfos map[string]*schedulerframework.NodeInfo, nodesFromNotAutoscaledGroups []*corev1.Node) (int64, int64, errors.AutoscalerError) {
	var coresTotal int64
	var memoryTotal int64
	for _, nodeGroup := range ctx.CloudProvider.NodeGroups() {
		currentSize, err := nodeGroup.TargetSize()
		if err != nil {
			return 0, 0, errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to get node group size of %v: ", nodeGroup.Id())
		}

		nodeInfo, found := nodeInfos[nodeGroup.Id()]
		if !found {
// @@@@@
//  after cluster-autoscaler is restarted nodeInfos does not have an entry for the NodeGroup that is entirely NotReady 
// @@@@@
                       return 0, 0, errors.NewAutoscalerError(errors.CloudProviderError, "No node info for: %s", nodeGroup.Id())
		}

		if currentSize > 0 {
			nodeCPU, nodeMemory := utils.GetNodeCoresAndMemory(nodeInfo.Node())
			coresTotal = coresTotal + int64(currentSize)*nodeCPU
			memoryTotal = memoryTotal + int64(currentSize)*nodeMemory
		}
	}

	for _, node := range nodesFromNotAutoscaledGroups {
		cores, memory := utils.GetNodeCoresAndMemory(node)
		coresTotal += cores
		memoryTotal += memory
	}

	return coresTotal, memoryTotal, nil
}

and if this function returns an error, call stack hits this if statement which in the comment states that there is no reason to proceed with scale up:

func (m *Manager) ResourcesLeft(ctx *context.AutoscalingContext, nodeInfos map[string]*schedulerframework.NodeInfo, nodes []*corev1.Node) (Limits, errors.AutoscalerError) {
	nodesFromNotAutoscaledGroups, err := utils.FilterOutNodesFromNotAutoscaledGroups(nodes, ctx.CloudProvider)
	if err != nil {
		return nil, err.AddPrefix("failed to filter out nodes which are from not autoscaled groups: ")
	}

	totalCores, totalMem, errCoresMem := m.coresMemoryTotal(ctx, nodeInfos, nodesFromNotAutoscaledGroups)

	resourceLimiter, errgo := ctx.CloudProvider.GetResourceLimiter()
	if errgo != nil {
		return nil, errors.ToAutoscalerError(errors.CloudProviderError, errgo)
	}

	var totalResources map[string]int64
	var totalResourcesErr error
	if cloudprovider.ContainsCustomResources(resourceLimiter.GetResources()) {
		totalResources, totalResourcesErr = m.customResourcesTotal(ctx, nodeInfos, nodesFromNotAutoscaledGroups)
	}

	resultScaleUpLimits := make(Limits)
	for _, resource := range resourceLimiter.GetResources() {
		max := resourceLimiter.GetMax(resource)
		// we put only actual limits into final map. No entry means no limit.
		if max > 0 {
			if (resource == cloudprovider.ResourceNameCores || resource == cloudprovider.ResourceNameMemory) && errCoresMem != nil {
				// core resource info missing - no reason to proceed with scale up
				return Limits{}, errCoresMem
			}
...

and by that, autoscaling of the entire cluster is halted.

Seems like the coresMemoryTotal function couples nodegroups together in the context of autoscaling, because if one is completly down others can't scale up. So now imagine there is one nodegroup that is in completly different availability zone than other nodegroups, it goes down entirely and suddenly autoscaling of all nodegroups in all availability zones is blocked because of this one nodegroup. If this an intended behaviour?

I noticed that if I change the function that calculates the total resources from all nodegroups to skip missing nodegroups instead of returning an error, then autoscaling works again:

func (m *Manager) coresMemoryTotal(ctx *context.AutoscalingContext, nodeInfos map[string]*schedulerframework.NodeInfo, nodesFromNotAutoscaledGroups []*corev1.Node) (int64, int64, errors.AutoscalerError) {
	var coresTotal int64
	var memoryTotal int64
	for _, nodeGroup := range ctx.CloudProvider.NodeGroups() {
		currentSize, err := nodeGroup.TargetSize()
		if err != nil {
			return 0, 0, errors.ToAutoscalerError(errors.CloudProviderError, err).AddPrefix("failed to get node group size of %v: ", nodeGroup.Id())
		}

		nodeInfo, found := nodeInfos[nodeGroup.Id()]
		if !found {
			continue // <- continue instead of returning an error if NG info is missing
			//return 0, 0, errors.NewAutoscalerError(errors.CloudProviderError, "No node info for: %s", nodeGroup.Id())
		}
...

I have no idea what side effects that might cause and I don't know what is the purpose of calculating total resources from all NodeGroups and erroring out if one of them is missing node infos.

I'd be very grateful for your comment on this issue 🙂

H3Cki avatar Aug 07 '23 11:08 H3Cki

Hey,

any updates on this issue? A customer of us is facing the same issue, leaving the cluster unable to resize.

modzilla99 avatar Oct 26 '23 06:10 modzilla99

A solution to this problem would be greatly appreciated! Thanks!

modzilla99 avatar Nov 07 '23 08:11 modzilla99

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Mar 07 '24 14:03 k8s-triage-robot

/remove-lifecycle stale

pawcykca avatar Mar 07 '24 14:03 pawcykca

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Jun 19 '24 16:06 k8s-triage-robot

/remove-lifecycle stale

pawcykca avatar Jun 19 '24 16:06 pawcykca