ClusterRunner icon indicating copy to clipboard operation
ClusterRunner copied to clipboard

Add timeout configuration on atom

Open wjdhollow opened this issue 8 years ago • 3 comments

A build could potentially take a long time, but the ClusterRunner is made for breaking up a large task into small chunks. That being said, the expectation is that a sub job should finish within a reasonable amount of time. If it does not finish in a short period of time, nodes will not be deallocating in a timely manner, if at all, which can severely impact the cluster. For example, there have been occurrences where a single subjob has been stuck indefinitely.

I think that subjob durations should be restricted to a finite time limit.

wjdhollow avatar Aug 26 '16 17:08 wjdhollow

Doesn't seem like a bad idea to add an "atom_timeout" field or something of the sort to clusterrunner.yaml.

tjlee0909 avatar Aug 26 '16 17:08 tjlee0909

Fixing build cancellation would solve part of this problem (the client can cancel a job when it has taken longer than it wants to wait). Currently cancellation will not interrupt in-progress atoms.

I agree with TJ that if we added this that it should be an atom_timeout vs. a subjob_timeout. Subjobs are an intermediate internal batching that users don't have control over. Users have control over their atoms.

If we were to add a default atom timeout, it should be very large and configurable in the clusterrunner.conf.

josephharrington avatar Aug 26 '16 19:08 josephharrington

Atom timeout makes sense.

I filed this issue because it looked like a bunch of subjobs were stuck on the dashboard. After talking with Joey, it sounds like some may be false positives caused by #287.

wjdhollow avatar Aug 26 '16 19:08 wjdhollow