autoscaler Cluster autoscaling w/ podAffinity rules

Which component are you using?:

helm chart - cluster autoscaler

What version of the component are you using?:

helm chart cluster-autoscaler-9.21.0 APP VERSION 1.23.0

Component version:

What k8s version are you using (kubectl version)?:

1.22

What environment is this in?:

EKS

What did you expect to happen?:

I have a pod that gets deployed that has a label, which other pods must schedule on the same node to leverage the multus bridge adapter. This initial pod has a node affinity to set it on a specific node group. I then have several other pods that have podAffinity rules matching the label of this particular pod. I expected the original pod to be evicted and re-scheduled on a new node if the current resources (with linked pods) can not be met on that initial host.

What happened instead?:

When resources was full on the host with the original pod assigned, the pods with podAffinity rules linking to said pod get stuck in a pending state due to insufficient CPU while the other nodes show affinity rules not permitting it.

How to reproduce it (as minimally and precisely as possible):

Have a group of 5 different pods you are deploying. Set resource requirements that will require an increase in nodes to scale. Set labels for these pods. Create a deployment for each that has podAffinity matching one of the initial pods with resources requested. If the original pods all reside on the same host, they don't get evicted onto new nodes to meet the new requirements of the pods with affinity rules requiring they run on the same node although if the nodes increased in size and moved groups between them there would be plenty of resources.

Anything else we need to know?:

If this isn't a feature, then could we look into making it a feature request? The only option I am currently seeing is making my own pod scheduler to determine where to schedule pods and manually move them when the autoscaler adds new nodes to the cluster.

Oct 07 '22 19:10 jbreed

I added annotations for safe-to-evict to true, but this doesn't evict the pod with the labels the others use for inter-pod affinity to place them all on a new node with enough resources available.

This aspect is needed for auto-scaling with inter-pod Affinity rules applied.

Given there is a single pod deployed, if the helm chart is using the default for pod disruption (1?) will this prevent the autoscaler from evicting the single pod to fit on a new node?

Oct 07 '22 19:10 jbreed

I understand the problem here, but I am not sure that cluster autoscaler is the right place to solve it. The only time cluster autoscaler will evict pods is during scaledown, if it detects that the pods running on a node can be moved somewhere else and the node can be terminated. I think it would be a significant change in responsibilities (and a lot of work!) if cluster autoscaler were to start watching the state of the cluster and shuffling things around to satisfy unmet scheduling constraints.

My recommendation here is to write an operator for your workload. The operator can be responsible for watching the resources for the node your initial pod is on, and if that node doesn't have sufficient resources to schedule your remaining pods, the operator can evict the pod (maybe with a taint applied so it doesn't get immediately rescheduled to the same node?) and let the scheduler try again.

Your other option might be to look at a custom scheduler for your cluster such as volcano, which may do a better job of scheduling your workloads than the default kubernetes scheduler.

Oct 08 '22 18:10 drmorr0

@drmorr0

Thanks. Yeah, I would prefer to not have to write my own scheduler; however, seems it may be the only option at this time.

For affinity rules, it appears requiredduringexecution is not currently implemented. Once this is implemented, it may handle the use-case I am mentioning.

Should this be a feature of the scheduler itself? It should know if pods are unscheduled due to affinity requirements and re-schedule to meet requirements. If the resources are not available, then the autoscaler should be able to scale more nodes online to allow the scheduler to leverage.

Oct 10 '22 14:10 jbreed

@drmorr0

I am going to test using pod anti-affinity rules on the main pods. If these schedule on their own nodes, then this will allow the pods with affinity rules connected to these top-level pods to be scheduled together. As long as the auto-scaler will increase nodes based on anti-affinity rules, then I think this will work although not the most efficient way.

UPDATE: I tested using pod anti-affinity rules for the primary pods to force them onto separate nodes. This seemed to handle this specific use-case by only scheduling them on separate nodes altogether. As mentioned, not ideal for resource utilization but can work for what I am doing as I'll just make that node-group component use less resources than what I was using previously.

Oct 10 '22 17:10 jbreed

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Jan 08 '23 20:01 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

Feb 07 '23 21:02 k8s-triage-robot

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Mar 09 '23 21:03 k8s-triage-robot

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 09 '23 21:03 k8s-ci-robot

autoscaler autoscaler copied to clipboard

Cluster autoscaling w/ podAffinity rules

autoscaler
autoscaler copied to clipboard