armada Add support for native preemption retries

What type of PR is this?

New feature

What this PR does / why we need it:

This pull request add support for native Armada Preemption Retry Handling. Retry handling can be configured at the platform level as a default in the scheduling config as well as with two annotations:

armadaproject.io/preemptionRetryCountMax (defaults to 0 for now, ie. disabled if not configured)
armadaproject.io/preemptionRetryEnabled

The scheduling algorithm has been modified to not fail jobs that are preempted if they are eligible for a retry. If the job is eligible to be retried it will be marked to be requeued.

Unit tests are included. We've also tested this end to end in our development environment with jobs/gangs and combinations of successful retries as well as exhausting retries.

Screenshot 2025-04-15 at 7 09 04 PM

Which issue(s) this PR fixes:

Fixes: https://github.com/armadaproject/armada/issues/4340

Special notes for your reviewer:

Apr 19 '25 00:04 jparraga-stackav

Hi,

thanks for this- it looks great. One thing I think we should consider/discuss is whether we could add the preemption retry fields as first-class proto fields rather than relying on annotations. Reasoning here is that anottions are very easy to add at first as they require no interface changes, but can soone get quite hard to work with as evwerything is just a map[string]string.

Personally I'd be in favour of adding these fields to schedulerobjects.JobSchedulingInfo right now with the view of adding them to api.Job once the feature is stable.

One comment I

Apr 29 '25 06:04 d80tb7

Personally I'd be in favour of adding these fields to schedulerobjects.JobSchedulingInfo right now with the view of adding them to api.Job once the feature is stable.

I looked into this but it seems a bit difficult to keep it is an annotation and not a first class citizen.

I'm not able to create the scheduling info from an armadaevents.SubmitJob since the annotations aren't there. I think there would also need to be some more thinking about how the global preemption retry config integrates into this. Might need to move that into the scheduler ingester to more elegantly handle that unless we want to reference the global preemption retry config all the time.

Apr 30 '25 04:04 jparraga-stackav

@jparraga-stackav Thanks for the work on this. I was wondering if we could also handle imminent node shutdown scenarios. In addition to the current retry logic for pod evictions, it might be useful to check for pods terminated with the reason:

Pod was terminated in response to imminent node shutdown.

We could incorporate a simple check on pod.Status.Message to capture these cases. This would help cover both graceful shutdowns and node preemptions for various reasons.

May 09 '25 03:05 Cinojose

@jparraga-stackav Thanks for the work on this. I was wondering if we could also handle imminent node shutdown scenarios. In addition to the current retry logic for pod evictions, it might be useful to check for pods terminated with the reason:

Pod was terminated in response to imminent node shutdown.

We could incorporate a simple check on pod.Status.Message to capture these cases. This would help cover both graceful shutdowns and node preemptions for various reasons.

I will likely look into this as a follow up improvement.

May 22 '25 18:05 jparraga-stackav