eraser icon indicating copy to clipboard operation
eraser copied to clipboard

Run jobs in waves instead of starting on all nodes at once

Open pmengelbert opened this issue 2 years ago • 0 comments

Describe the solution you'd like Consider the following situation:

  • A cluster is running 600 nodes, each node runs about 30 pods
  • They scale based on # of pods per node, because this particular CNI is hungry with regard to reservation of IP addresses
  • If Eraser gets scheduled on a node, and is in the right priority class, it will evict something else
  • If eraser runs on all nodes (in their case 600) at the same time, it is a large disruption to the system (600 nodes all evicting some pod all at one time)

A potential solution is to disturb only a part of the system by running the job in phases. Say, we run the job on ceil(nodes / 5.0) nodes at a time, and stop when all nodes have completed a run.

Anything else you would like to add: It definitely complicates the logic

Environment:

  • Eraser version: v1.0.0-beta.3
  • Kubernetes version: (use kubectl version):

pmengelbert avatar Jan 12 '23 18:01 pmengelbert