autoscaling
autoscaling copied to clipboard
Remove scheduler plugin "buffer" resources
Problem description / Motivation
In order to prevent accidental overcommitting, on startup the scheduler plugin has a measure of "uncertainty" for each VM's usage that is resolved only when the autoscaler-agent makes a request to the scheduler plugin to inform it of its intentions.
This has two issues:
- Forced to choose between "unavailable" and "inaccurate", we have chosen "unavailable". In practice this is much worse / higher risk than inaccuracies.
- There's a short (~5s) period of unavailability immediately after startup. When we add replicas & leader election for the scheduler, this unavailability could be frequent enough to cause liveness issues for the scheduler (i.e. it may be unable to schedule for extended periods of time)
Feature idea(s) / DoD
Scheduler plugin scheduling uncertainty should not cause unavailability
Implementation ideas
Instead of keeping "buffer" resources, we should just entirely remove it, and be willing to make inaccurate scheduling decisions. Worst-case scenario is that we accidentally overcommit by a little bit — in practice, real resource usage in our clusters is much lower than reserved resources, so we have wiggle room.