alchemiscale Add user-settable server-side `Task` restart policy, per-`AlchemicalNetwork`

As many Tasks are executed on compute services running on disparate resources, it's likely that random errors will impact some fraction of the tasks, with some countably-small set of failure modes. Currently, users must examine error tracebacks themselves, then set the Tasks they wish to run again from error to waiting status. This can get tedious, and requires many users to babysit their Tasks, even if on rerun many of these will complete successfully.

Instead of this, we would like to empower users with the ability to set a TaskRestartPolicy on an AlchemicalNetwork, which would encode a list giving:

regex pattern of the traceback output to match
max number of retries to perform for matching errors
other options, such as how strongly to avoid a compute service with the same identifying information as one that previously failed on the Task.

Related to #258. Likely requires #109 to be implemented in some form to periodically apply server-side restarts given the policies set.

Jun 05 '24 04:06 dotsdl

Thanks to @JenkeScheen for raising this issue in today's user group meeting!

Jun 05 '24 04:06 dotsdl

@ianmkenney would you be willing to begin work on this as a head start on the next major milestone? This of high interest for users, so prioritizing it makes sense for us.

Jun 13 '24 15:06 dotsdl

@ianmkenney can you link your design doc here?

Jul 12 '24 04:07 dotsdl

Here is the link to the design doc.

Jul 12 '24 16:07 ianmkenney