paasta icon indicating copy to clipboard operation
paasta copied to clipboard

Change the source of truth for required capacity in cluster autoscaler

Open Rob-Johnson opened this issue 8 years ago • 2 comments

@mattmb following up on our chat yesterday:

  • The cluster autoscaler currently looks at usage per slave - it doesn't make any decisions based on 'what' the tasks are.
  • In the event of a crossover bounce of a 'hungry' app (ym), Marathon will launch as many instances as possible, filling the slack capacity, and then launches the rest as resources are made available by drained/killed tasks. If the cluster autoscaler runs just after marathon fills all the space it can, it will determine that we need to grow by as many instances as required to get us back down to 80% (even though we don't need to grow at all - the crossover bounce will kill off tasks and bring us back to normal).

As a solution, rather than using the mesos slaves as the source of truth for demand, should we be calculating it based on the resources specified per service-instance (and ignoring multiple versions so we don't count extra capacity whilst we're bouncing)? we'll probably have to get all tasks from Marathon, dedupe by service/instance and calculate the required resources for each task. Likewise for Chronos, though we'll have to filter according to those tasks that are running at the time.

This does introduce extra complexity: we'll have to do the work to group apps according to constraints so that we can calculate the right demand per pool. We also lose insight into ghost tasks and remote-run tasks.

Rob-Johnson avatar Mar 16 '17 16:03 Rob-Johnson

Thanks for writing this up @Rob-Johnson 😄

mattmb avatar Mar 16 '17 17:03 mattmb

Is it really a bug that the cluster autoscaler is trying to scale up the cluster when it is at 100%, even if "we know" that it is "just a bounce"?

Isn't this "just" a function of time and rate of change? If the autoscaler worked "instantly" then this would not be a problem. The fact that the autoscaler's responsiveness is less than that of marathon + docker is the root cause of this phenomenon: we don't want spikes in the services to make the cluster overshoot.

To me this is "just" PID tuning. The autoscaler should scale up proportionally to the rate of change, taking into account how long it takes (delay) for the new capacity to come up.

solarkennedy avatar Mar 16 '17 22:03 solarkennedy