kueue Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics

Open woehrl01 opened this issue 5 months ago • 5 comments

What would you like to be added:

We would like to propose a new feature in Kueue that enables dynamic scaling of job parallelism and resource allocation (CPU, RAM, and pods) based on job backlog metrics and predefined formulas.

Idea: This feature would introduce a custom resource definition (CRD) that allows users to define scaling formulas and thresholds, which dynamically adjust the maximum parallelism and resource limits, similar to KEDA or HPA. A generic approach could be the exposing of the /scale subresource to have a generic interface.

Why is this needed:

Currently, we are processing around 4.5 million jobs per day, and managing resource usage and costs is critical. There is a need for a mechanism that can dynamically limit or expand the maximum parallelism of jobs based on real-time backlog conditions. This would help ensure that jobs are processed efficiently without overcommitting resources or incurring unnecessary costs.

By introducing a formula-based approach to flavor resources, we can achieve a more granular and responsive system. For example, the system could increase the max CPU or RAM allocation as the admission backlog grows, ensuring that delays are minimized during high-load periods while conserving resources during low-demand times. This functionality is crucial for maintaining both performance and cost-effectiveness in large-scale Kubernetes environments.

This enhancement requires the following artifacts:

[ ] Design doc
[ ] API change
[ ] Docs update

The artifacts should be linked in subsequent comments.

Sep 03 '24 06:09 woehrl01

kueue kueue copied to clipboard

Dynamic Job Parallelism and Resource Scaling Based on Backlog Metrics

kueue
kueue copied to clipboard