aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors

Open sphuber opened this issue 5 years ago • 7 comments

The exponential back off retry mechanism of the engine for calculation jobs has greatly improved the robustness with respect to temporary recoverable problems such as a loss of internet connection or clusters being down. However, since this indiscriminately applied to all errors during transport tasks, sometimes the tasks are restarted even though it will never work. Take for example where the wrong scheduler parameters are passed as inputs. No matter how often one restarts, the task will always fail. Retrying the task is futile.

To solve this, the scheduler plugin interfaces with the engine need to be improved, such that they can make a distinction between recoverable and irrecoverable errors. The default assumption for the engine will remain that the error is recoverable with a retry, but with this new option a scheduler plugin can instruct the engine to not even bother when it encounters certain errors. Since these errors are going to be scheduler and transport specific, their respective plugins should be responsible for this.

sphuber avatar Jun 03 '19 09:06 sphuber

Just to give an example of one specific case:

File "/home/efl/work/devel/aiida_core/aiida/schedulers/plugins/slurm.py", line 431, in _parse_submit_output
    "stdout={}\nstderr={}".format(retval, stdout, stderr))
aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
stdout=
stderr=sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

06/03/2019 11:01:38 AM <23614> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] maximum attempts 5 of calling do_submit, exceeded
06/03/2019 11:01:39 AM <23614> aiida.engine.processes.calcjobs.tasks: [WARNING] submitting CalcJob<832> failed

This should stop at first submission and return some kind of usable exit code that is passed along.

espenfl avatar Jun 03 '19 09:06 espenfl

Also, when we fix this, we most likely have to write a stdout and stderr scheduler parser of some sort. Since we need this for other purposes it would be nice to make it modular. For error handlers and cluster monitoring, it would be nice to be able to request some status of the stdout and stderr from the scheduler at any given time (meaning, we should open for the possibility of this request to go through the transport and parse at the cluster side).

In fact, for e.g. Slurm, there is an API, so no parsing is really needed. There is even PySlurm. This is probably not true for the general scheduler, but we should focus on at least having full support for PBS and Slurm.

espenfl avatar Jun 03 '19 09:06 espenfl

Regarding your second comment, we have this open issue. Note, however, that part of the required code for that issue is indeed shared with his one, i.e. a parser for the stderr and stdout content returned by scheduler commands, allowing to act on those things while the calculation job is running is a quite a different problem and currently I am not sure how to implement this in the engine. Essentially when the calculation job is in the stage where it is querying the scheduler for a status update (the update transport task), it should also request the content of the stderr and stdout written on the cluster for that job, perform some parsing and optionally kill the job. This means there needs to be some hook on the CalcJob class that can implement this logic. If this logic is implemented (and not optionally disabled with some input setting on the CalcJob) then the engine can also call this transport task, in addition to the UPDATE one. I am not sure whether this should become a "second" transport task that is run at the same time as the UPDATE one, since that might be complicated from the engine's perspective. Or the UPDATE needs to be dynamically augmented with an additional step. To be discussed.

sphuber avatar Jun 03 '19 09:06 sphuber

In fact, this issue: https://github.com/aiidateam/aiida-core/issues/2955#issuecomment-498181856 now keeps the state at Waiting for the calculation. Finally it pauses. Before it failed after 5 tries, but now with the pause, it just hangs, even though there is no recovery possibly (given the info at hand). With the pause mechanism, it becomes even more important that we address #1925 as this error happens all the time if you go between different systems and accounts. Maybe we should consider to put this in before the release it published as I expect this issue to appear quite frequently.

espenfl avatar Oct 02 '19 09:10 espenfl

This is a duplicate of #2226 which was closed, but we can continue discussion in this issue

sphuber avatar Oct 07 '20 08:10 sphuber

+1 for adding parsing to the SLURM plugin to detect this error and return a corresponding exit code with instructions to provide a metadata.options.queue_name to fix this

ltalirz avatar Mar 16 '22 10:03 ltalirz

+1 Just ran into this again

ltalirz avatar Jul 29 '22 16:07 ltalirz