armada icon indicating copy to clipboard operation
armada copied to clipboard

Native support for preemption retries

Open jparraga-stackav opened this issue 8 months ago • 0 comments

Is your feature request related to a problem? Please describe.

Armada has support for urgency and fair share based preemption which is a useful feature for prioritizing important workloads and enforcing fair share of resources. One of the pain points with enabling preemption is that when workloads are preempted they are not requeued or resubmitted. Currently it is the responsibility of the user to recognize that their workloads are preempted and then resubmit them.

Describe the solution you'd like

Ideally Armada would have native support for retrying preempted workloads. Retries could be configured as global defaults or opted in at an individual job/gang level with annotations. When a job is preempted it returns to a queued state until it is able to be scheduled again. Each time a job is preempted it is captured with a preempted job run.

Describe alternatives you've considered

We have prototyped an external entity which watches for preempted workloads and automatically resubmits them. This workflow feels a bit clunky and computationally wasteful as this entity must constantly either poll for preempted workloads or watch job sets. The APIs to do all of these operations are mixed between Armada Lookout and Armada Server. This can be particularly complicated when a single jobset has one or many different jobs/gangs that should be retried.

Additionally, any system that submits to Armada has to be aware that their workloads may be resubmitted which creates additional work for integrations built on top of Armada.

Additional context

We have prototyped support for this and I will include a pull request shortly.

jparraga-stackav avatar Apr 18 '25 23:04 jparraga-stackav