keda icon indicating copy to clipboard operation
keda copied to clipboard

Integration with the Cluster Autoscaler

Open jeffhollan opened this issue 4 years ago • 24 comments

Today KEDA can indirectly influence cluster autoscaling by causing the HPA to schedule too many pods so the cluster autoscaler kicks in and adds nodes. A common ask is to have KEDA potentially poke the cluster autoscaler even earlier the same way it pokes HPA before CPU + Memory hit. Not sure what integrations make this possible today, but would be a nice feature to be able to set some cluster threshold or something, or some way to describe driving cluster scale in addition to HPA scale.

jeffhollan avatar Feb 20 '20 16:02 jeffhollan

Certainly an interesting scenario if you ask me!

Would you use the same component or split them? Maybe good to have seperation between app & cluster autoscaling so people can pick which component they are interested in.

tomkerkhove avatar Feb 20 '20 16:02 tomkerkhove

@melmaliacone know you were interested in looking in some more "kubernetes deep" features - this may be a good one. Also @jaypipes mentioned at the SIG-Runtime meeting he'd be interested to help collaborate as well 👍

jeffhollan avatar Feb 20 '20 17:02 jeffhollan

Awesome! Would propose for this one to draft a design spec on how it would work and what it would look like or is that overkill? Actually liked that with the introduction of the auth spec.

We could move those to design-proposals/ or so.

tomkerkhove avatar Feb 20 '20 17:02 tomkerkhove

Hi @jeffhollan, team, In AKS to indirectly trigger Cluster Autoscaler from HPA custom metric, we are using low priority pods as a buffer to overprovision the node scaler. Depending on the amount of pods that you want to buffer, we can configure how fast to scale. More details: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-configure-overprovisioning-with-cluster-autoscaler Helm: https://hub.helm.sh/charts/stable/cluster-overprovisioner Hieu

hieumoscow avatar Mar 23 '20 10:03 hieumoscow

That is correct, we don't have to do anything on our end since it's based on cluster resources which will implicitely impacted by the HPAs.

tomkerkhove avatar Mar 23 '20 11:03 tomkerkhove

Hi @tomkerkhove, What I meant is we had to do this manually in AKS for a few customers where they want the cluster autoscaler to kick in earlier than waiting for CPU & memory limit to hit. It is not built in by default. Thus, within KEDA, we could implement a solution where KEDA adjust the buffer value for low priority pods to control how fast Cluster Autoscale provisions new nodes to cope with the events.

hieumoscow avatar Mar 24 '20 11:03 hieumoscow

I'm not sure if that's actually our responsibility to do that as we solely give application autoscaling and the rest is part of cluster autoscaling.

We could provide another component, but I'm not really sure what it would do then? Tell the CA to scale out by changing the buffer?

tomkerkhove avatar Mar 24 '20 12:03 tomkerkhove

I think this will depends on how sensitive the application is to a scaling delay. For KEDA or any other event based framework, the scale bottleneck is most likely to be the CA.

Either like you said have another component to tell CA how fast to scale via buffer. Alternatively I see there's a potential fit with an incubator project which is Cluster Proportional Autoscaler (CPA). It is based on same overprovisioner concept but we can define the buffer size to be proportional to the cluster size (e.g. 10% of Cluster Cores). Thus, the more events that KEDA detects & tells HPA to scale out, the bigger the buffer size CPA will adjust to trigger CA scale out speed. I could look into doing a POC here.

hieumoscow avatar Mar 25 '20 15:03 hieumoscow

I think what you are looking for is Virtual Nodes to overflow until CA catches up but will evaluate.

What do you think @jeffhollan @zroubalik?

tomkerkhove avatar Mar 25 '20 17:03 tomkerkhove

Several people have expressed interest in the potential for some use of KEDA to solve cluster autoscaling challenges. So that we don't try to solve all possible scenarios, I started a document to gather use cases to focus this discussion around.

craiglpeters avatar Jun 01 '20 17:06 craiglpeters

Thanks - I've added a section with alternatives such as Virtual Nodes on AKS which solves exactly this scenario.

Personally I'm not sure yet where KEDA can help and if we should do it, or if we should bring this to CA team - But if we can help, why not!

tomkerkhove avatar Jun 02 '20 14:06 tomkerkhove

Just to be clear - It's not that I don't think it's a good idea but merely making sure we are fixing gaps and not reinventing the wheel!

tomkerkhove avatar Jun 02 '20 15:06 tomkerkhove

I really like the idea event driven cluster autoscaling, but I'm not sure exactly how this might work. In fact, I'm not even sure I understand the e2e journey of architecting a pod-scaled workflow with Keda.

Let's take, for example, an SQS Queue with some length. Keda is configured to scale on a threshold of 5 messages.

  • How many nodes should the autoscaler scale up?
  • Is 1 node -> 5 messages, the threshold? What is there are 1,000 messages?
  • Can messages be of a different weight?
  • Who is responsible for dequeuing messages out of the queue?

I think I'm probably missing something fundamental about Keda. My understanding is that for pods, Keda can scale up a new pod based off of a queue threshold, and then the pod is responsible for draining the queue. For nodes, this doesn't make as much sense to me as there's no agent to pop a message off of the queue. It also isn't clear to me what mechanism stops the cluster from infinitely scaling if the queue doesn't drain.

ellistarn avatar Jul 17 '20 00:07 ellistarn

I think I'm probably missing something fundamental about Keda. My understanding is that for pods, Keda can scale up a new pod based off of a queue threshold, and then the pod is responsible for draining the queue.

Your understanding is correct.

We are looking how we can help the CA scale because of spikes or so that we are seeing but this is still under investigation if it would make sense. Personally, I'm not convinced yet this is something we can add enough value to.

tomkerkhove avatar Jul 17 '20 07:07 tomkerkhove

Some more reasons to have integrations with the Cluster Autoscaler: https://stackoverflow.com/questions/63495899/using-multiple-autoscaling-mechanisms-to-autoscale-a-k8s-cluster

raravena80 avatar Aug 19 '20 23:08 raravena80

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 27 '21 08:11 stale[bot]

Cluster autoscaler does not provide any integration and we want to solely focus on application autoscaling.

However, sometimes there are cases where we know that a lot of work will arrive pretty soon and we will need more capacity. Examples are:

  • Predictive app scaling through platforms such as PredictKube and other potential prediction in the future
  • Scaling sources that have high volumes of work
  • Custom metrics from end-users that can provide indications here

So I've been thinking how we can help without building our own autoscaler and I landed on this rough idea: Leverage the power of our /scale subresource for CRDs to spin up pod buffers that will trigger cluster autoscaler.

Imagine that the KEDA community or KEDA provides an external scaler/add-on that allows you to:

  1. Create an instance of a new CRD (~ScaledNodePool) that points to a pool of nodes in your cluster
  2. ScaledObject can be created with an instance of the above CRD as scale target
  3. Triggers that seem fit can be used as triggers for node scaling
  4. KEDA asks the CRD to scale
  5. Add-on will spin up the pod buffer in its namespace and wait for new nodes to be added

This should clearly not be part of KEDA core but could be a way to:

  1. Provide proactive cluster autoscaling
  2. Simplify the whole pod buffer workaround so that people should not have to worry about it

The tricky thing though, might be the decision making on when to remove the pod buffer et all but that's my rough idea.

tomkerkhove avatar Feb 09 '22 13:02 tomkerkhove

This is based on a scenario that @denniszielke had for KEDA end-user.

tomkerkhove avatar Feb 09 '22 14:02 tomkerkhove

Yeah, I have a similar idea, where the new CRD is responsible for the autoscaling.

+1 and also who will implement this? :)

zroubalik avatar Feb 17 '22 15:02 zroubalik

re: https://github.com/kedacore/keda/issues/637#issuecomment-1033790013

FWIW, Karpenter v0.1.1 implemented exactly this: https://github.com/aws/karpenter/tree/v0.1.1/pkg/apis/autoscaling/v1alpha1.

We found that the sticking point with users was finding useful signals to scale the node groups.

The tricky thing though, might be the decision making on when to remove the pod buffer et all but that's my rough idea.

Unless your use case has a simple mapping of pods to nodes, or you have very low diversity of nodes in a cluster, the signal is going to be so complicated that you're back to using pending pods as a signal.

My current hypothesis for this problem is that this could be best solved via a predictive pod scaler. If you could preview the pod signal by ~3 minutes or so, that would entirely eliminate the slow node provisioning problem. Further, this will help with the latency introduced at the pod level, like pulling big images, hydrating caches, etc.

ellistarn avatar Feb 17 '22 17:02 ellistarn

@ellistarn Do you support /scale subresource so we could scale Karpenter pro-actively?

tomkerkhove avatar Feb 18 '22 06:02 tomkerkhove

@ellistarn Do you support /scale subresource so we could scale Karpenter pro-actively?

We moved away from this approach, since we decided that the best signal for our customers was pending pods. As mentioned, we're hoping for predictive pod scalers to solve the overprovisioning problem.

ellistarn avatar Feb 18 '22 06:02 ellistarn

I was not aware of that move, sorry. Allow me to ask - How does it differentiate from Cluster Autoscaler then?

tomkerkhove avatar Feb 18 '22 06:02 tomkerkhove

It doesn't rely on node groups and collapses the combinatorial node group expansion problem of cas. There are a ton of other minor points that make it generally more usable, faster to spin up nodes, and work at higher scale. Worth checking out the docs or blogs if you're interested.

Fwiw, we have a bunch of folks using karpenter and keda together.

ellistarn avatar Feb 18 '22 17:02 ellistarn