Add a timeout flag to agents
We are experiencing some issues where a job may sometimes hang and because of this our stack ends up not scaled down during the night.
We can add timeout_in_minutes in individual steps, but this being opt-in is not ideal. What we'd really want is to impose a hard global limit of the form "no job can run for more than XX minutes". Doing this at the agent level probably makes more sense, since we could then use different queues for different hard limits if we wanted to.
I see there was a PR trying to implement this (#788) but unfortunately it was never finished.
I'll check in on where we got to with #788!
Just adding my 2 cents.
I'd like to configure a global timeout. (eg, 12h) The intention is to automatically kill long-running jobs which are unresponsive/dead and will never complete. This is something I'd like to configure just once per agent pool, not per pipeline/stage.
For context, I sometimes discover jobs which have been running for many days or weeks. It would be nice if these were killed automatically after a sensible timeout. (I recently killed 20+ jobs which were running for more than 2 weeks!)