flux-sched Support optimizations of Rabbit allocations based on HPE co-design heuristics

Support optimizations of Rabbit allocations based on HPE co-design heuristics

Open jameshcorbett opened this issue 3 years ago • 0 comments

trafficstars

Rabbit Load Balancing

Ensure that allocations are spread as broadly across the rabbits as possible (no rabbit "hotspots")
Ensure that MDTs from different jobs are distributed as much as possible (MDT is very CPU intensive)

DragonFly Topology

Start with allowing scheduling within just a single pod versus free-form across the entire cluster
Move towards minimizing the distance between compute node pods and rabbit pods

Rabbit Spread Policy

Allow a single job to force multiple rabbits to be allocated to it (for bandwidth) without requiring allocating all of the storage on those rabbits

Dragonfly topology optimization needs to be flexible per-job. If the values are "within a pod", "same rack as nodes", "no constraints", we will want to be able to tune that per-job. (This could be an actual part of the jobspec resource section).

Same goes for maximizing the packing of rabbits into minimal # of pods and minimizing distance between nodes and rabbits. Some users will want that and others won't.

Prototype Rabbit Load Balancing PR: https://github.com/flux-framework/flux-sched/pull/812

Aug 23 '22 20:08 jameshcorbett

flux-sched flux-sched copied to clipboard

Support optimizations of Rabbit allocations based on HPE co-design heuristics

flux-sched
flux-sched copied to clipboard