autoscaling icon indicating copy to clipboard operation
autoscaling copied to clipboard

Remove scheduler plugin "buffer" resources

Open sharnoff opened this issue 5 months ago • 4 comments

Problem description / Motivation

In order to prevent accidental overcommitting, on startup the scheduler plugin has a measure of "uncertainty" for each VM's usage that is resolved only when the autoscaler-agent makes a request to the scheduler plugin to inform it of its intentions.

This has two issues:

  1. Forced to choose between "unavailable" and "inaccurate", we have chosen "unavailable". In practice this is much worse / higher risk than inaccuracies.
  2. There's a short (~5s) period of unavailability immediately after startup. When we add replicas & leader election for the scheduler, this unavailability could be frequent enough to cause liveness issues for the scheduler (i.e. it may be unable to schedule for extended periods of time)

Feature idea(s) / DoD

Scheduler plugin scheduling uncertainty should not cause unavailability

Implementation ideas

Instead of keeping "buffer" resources, we should just entirely remove it, and be willing to make inaccurate scheduling decisions. Worst-case scenario is that we accidentally overcommit by a little bit — in practice, real resource usage in our clusters is much lower than reserved resources, so we have wiggle room.

sharnoff avatar Mar 01 '24 07:03 sharnoff