pai icon indicating copy to clipboard operation
pai copied to clipboard

Customizable Autoscaler

Open hzy46 opened this issue 4 years ago • 1 comments

Motivation

When PAI is deployed on cloud, admins may want to stop some free nodes to save money. When a new job is submitted, the closed nodes can be started again to let the job fit in.

This feature is usually called "autoscaler", and was implemented in #4735 before. However, #4735 only works on AKS. We can design an extensible autoscaler framework, which works in different cloud environment: e.g. Azure Virtual Machine Scale Set, or other cloud provider.

hzy46 avatar Apr 02 '21 09:04 hzy46

There are a few points in which this proposal and other low-level auto-scaling services differ

  • more customizable. Users could customize easily to let OpenPAI, an AI workload platform, to make decision when and which worker nodes to be scaled. Admins could write custom codes to enable trigger conditions such as observation of waiting jobs, virtual cluster utilization, and other high-level and end-to-end metrics.
  • a snip of codes that could easily support multiple types and hybrids cloud infrastructures.

mydmdm avatar Apr 02 '21 09:04 mydmdm