autoscaling icon indicating copy to clipboard operation
autoscaling copied to clipboard

scheduler plugin should de-prioritize newer nodes

Open sharnoff opened this issue 5 months ago • 0 comments

Problem description / Motivation

Currently the load on the scheduler is somewhat unusual: we have (usually) short (but uneven) lifetimes of computes, with varying external load producing regular usage spikes that

This load sometimes interacts with our node scoring algorithm to result in chaotic (in the mathematical sense) and cyclical fluctuations in reserved resources on the nodes. This has a single primary effect:

  • We fail to produce nodes with lower usage when the cluster has capacity to get rid of a node (meaning we remain overprovisioned)

In particular, this happens most visibly when a node is added due to external demand — sometimes it is removed after demand returns to normal, but sometimes another node's usage goes down instead (but not far enough to be removed).

Here's a recent example:

Graph of reserved CPU on nodes in us-west-2, showing a node added, its usage spiky between 10-40%, and then an hour later its usage swaps with one of the many nodes at 80% reserved, with that other node's usage slowly decreasing afterwards towards 20-25%

Discussion here: https://neondb.slack.com/archives/C03TN5G758R/p1709660933447909

Feature idea(s) / DoD

To mitigate the issues above, the scheduler plugin should de-prioritize newer nodes - providing both a consistent ordering (preventing "swapping" usage between nodes) and explicitly prioritizing removal of nodes that are added to satisfy immediate demand (which will have fewer long-running computes).

Implementation ideas

From the slack thread linked above:

I'm imagining that the new node scoring algorithm should be the following (note scores are always 0 to 100).

  • If a node's usage is >85%
    • Score is 33 * (1 - (usage fraction - 0.85)) — i.e. higher usage is worse
  • Else, if it's one of the youngest ceil(20% of N) nodes:
    • Score is 33 + min(33, rank within youngest nodes) — i.e. younger (rank is a smaller number) is worse (overloaded terms; I intend that youngest is rank 1, second-youngest is rank 2, etc.)
  • Otherwise
    • Score is 66 + (usage fraction * 33) — i.e. higher usage is better

(specific numbers to replace 85 and 20 TBD)

sharnoff avatar Mar 06 '24 00:03 sharnoff