alchemiscale icon indicating copy to clipboard operation
alchemiscale copied to clipboard

Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork`

Open dotsdl opened this issue 1 year ago • 4 comments

As many Tasks are executed on compute services running on disparate resources, it's likely that random errors will impact some fraction of the tasks, with some countably-small set of failure modes. Currently, users must examine error tracebacks themselves, then set the Tasks they wish to run again from error to waiting status. This can get tedious, and requires many users to babysit their Tasks, even if on rerun many of these will complete successfully.

Instead of this, we would like to empower users with the ability to set a TaskRestartPolicy on an AlchemicalNetwork, which would encode a list giving:

  • regex pattern of the traceback output to match
  • max number of retries to perform for matching errors
  • other options, such as how strongly to avoid a compute service with the same identifying information as one that previously failed on the Task.

Related to #258. Likely requires #109 to be implemented in some form to periodically apply server-side restarts given the policies set.

dotsdl avatar Jun 05 '24 04:06 dotsdl

Thanks to @JenkeScheen for raising this issue in today's user group meeting!

dotsdl avatar Jun 05 '24 04:06 dotsdl

@ianmkenney would you be willing to begin work on this as a head start on the next major milestone? This of high interest for users, so prioritizing it makes sense for us.

dotsdl avatar Jun 13 '24 15:06 dotsdl

@ianmkenney can you link your design doc here?

dotsdl avatar Jul 12 '24 04:07 dotsdl

Here is the link to the design doc.

ianmkenney avatar Jul 12 '24 16:07 ianmkenney