agent icon indicating copy to clipboard operation
agent copied to clipboard

Proposal for --default-timeout-in-minutes flag

Open pda opened this issue 7 years ago • 17 comments

Introduces --default-timeout-in-minutes flag (or BUILDKITE_DEFAULT_TIMEOUT_IN_MINUTES environment) to buildkite-agent start, which is passed to the Register Agent API via an as-yet unsupported default_timeout_in_minutes attribute.

For this PR to be useful, Buildkite backend would need to support this new agent attribute. When a job is assigned to an agent and that job does not have a timeout_in_minutes attribute but the agent does have a default_timeout_in_minutes, the latter value would be used as the timeout.

This would allow a timeout to be specified at the agent level, where an agent may be handling jobs from a diverse range of codebases and pipelines that cannot be expected to always specify a timeout_in_minutes on every command step. This option will make agents more resilient to frozen/stalled jobs, increasing agent availability, decreasing wait times, and allowing https://github.com/buildkite/elastic-ci-stack-for-aws to scale down after a job freezes.

Some more context / discussion in https://github.com/buildkite/feedback/issues/170#issuecomment-400535789

  • [x] Flag in buildkite-agent
  • [ ] Support in Buildkite agent API / backend.

pda avatar Jun 27 '18 11:06 pda

Thanks @pda, this is interesting.

One thing I'd considered before is allowing pipeline mutations as part of pipeline, e.g buildkite-agent pipeline upload --default-timeout-in-minutes 10, possibly also --enforce-timeout-in-minutes 10.

Beyond the fact that it doesn't require any changes to the server-side, I like that this enforces the constraint at a pipeline level vs an agent level, but that might not suit what you are trying to do.

Thoughts?

lox avatar Jun 28 '18 13:06 lox

This change probably plays into a more general redesign of how timeouts should be handled.

Timeouts are managed by Buildkite at the moment. We have a CRON job that essentially looks over all the "in-flight" jobs every minute, and checks to see if they need job be timed out. When one does, we send the job/agent a "cancel" signal, at which point the agent will kill the current job.

I kinda like the idea of different reasons for a timeout:

  • Agent timeout: Set at the agent level which forces every job
  • Job timeout: When a timeout is set as a job property
  • Pipeline/organization timeout: Default set at that level

Perhaps instead of the timeout data flowing through BK, the agent level timeout should be handled directly in the agent? The agent spins up a timer thingy and kills the job when it's gone on too long. That timeout would be the "last defence" if something has just gone on too long.

We could perhaps then send a "agent_timed_out" type boolean along with the Finish API call to let BK know what happened..

Just an idea! Maybe just setting the default like you suggested @pda is the better option. Would love to chat more about it!

keithpitt avatar Jul 03 '18 02:07 keithpitt

One thing I'd considered before is allowing pipeline mutations as part of pipeline

@lox Yeah I wondered about that, but it solves it on the codebase/pipeline dimension. Our use-case is definitely on the agent dimension.

Pipeline/organization timeout: Default set at that level

An organization-level timeout could do the job, but it doesn't feel like the right place to define timeouts in a complex organization with different generations of agents/pipelines and different CI workloads.

Agent timeout: Set at the agent level which forces every job Perhaps instead of the timeout data flowing through BK, the agent level timeout should be handled directly in the agent? The agent spins up a timer thingy and kills the job when it's gone on too long. That timeout would be the "last defence" if something has just gone on too long. We could perhaps then send a "agent_timed_out" type boolean along with the Finish API call to let BK know what happened..

@keithpitt I quite like this approach, and considered implementing it in this PR, but wasn't sure how you'd want to to model the agent telling the server that it had timed out a job. I also wonder what happens if the agent fails to send the Finish API call with agent_timed_out — but I guess that's a scenario that already exists, where Buildkite thinks a job is running but the agent has died.

I'd be happy to update this PR to implement the agent timeout client-side and send the agent_timed_out boolean you proposed as part of the Finish API call.

pda avatar Jul 12 '18 02:07 pda

Doh, I totally missed your response here @pda!

I think it's totally reasonable that something can happen within the agent that would cause the job to not finish cleanly, and it's probably a good idea that the agent lets us know about it.

Here's what I think we can do:

  • Agent has the "last defence" timer
  • It kills the job after the timer
  • When it calls Finish, it sends us a "reason" as to why it failed.

Here are some reasons I can think of:

  • JOB_FINISHED
  • AGENT_TIMEOUT
  • MACHINE_SHUTDOWN

I kinda think we'd send the reason all the time.

I'm not sure if I like the word reason, but it's essentially an extra attribute in the Finish call that lets us know how the job came to be finished. Was it clean? Was it forced? etc.

@pda happy to accept a PR for this! You can send us the finish attribute now, BK will just drop it on the floor, but once it's merged - I can start tracking it and showing it in the UI.

keithpitt avatar Jul 19 '18 01:07 keithpitt

I'm keen to help get this built @pda, any interest still?

lox avatar Dec 16 '18 02:12 lox

I think there's still interest. I haven't had any time to think about it lately, though.

pda avatar Feb 07 '19 23:02 pda

The use case for us here is a safety guard against hijacking of public pipelines. Setting an enforced timeout on public pipelines can limit the impact of things like "bitcoin takeovers"

tduffield avatar Jan 23 '20 21:01 tduffield

Any progress on this one? we are dying for this feature.

zhaomengru2015 avatar Dec 28 '20 06:12 zhaomengru2015

We urgently need this feature

Xyz426 avatar Dec 28 '20 11:12 Xyz426

It's already two years, any update on this?

tata9001 avatar Dec 29 '20 05:12 tata9001

Any update on this? We're also very keen to use this feature. Long lived agents are causing a bit of grief.

liam-bailey avatar Jan 13 '21 07:01 liam-bailey

/sub

dhalperi avatar Feb 11 '21 00:02 dhalperi

Sorry to add yet another '+1' type of comment, but seeing how this is two years old, I felt perhaps it would be necessary. I'd love to have this feature as well.

goodspark avatar Feb 22 '21 20:02 goodspark

Keen to make use of this. We'd like a way to incentivise setting a timeout on individual steps. As it stands by default there is no incentive to do that. Our plan is to set a low threshold by default at the agent level, and steps that require longer can explicitly ask for it at the individual step level.

tigris avatar Sep 29 '21 09:09 tigris

Any interest in picking this up again @pda? Have just run into needing default timeouts.

lox avatar Feb 15 '22 05:02 lox

Would love to see this one picked up as well because we are currently having to use lambdas and monitoring to achieve this.

emmy-byrne-segment avatar Mar 31 '22 18:03 emmy-byrne-segment

Just found several builds that were running for 18 hours, and one which had apparently been running for 7 weeks! Would love a way to set a blanket limit quickly.

bethesque avatar Aug 30 '22 01:08 bethesque