aiida-core
aiida-core copied to clipboard
Scheduler plugins need to be able to distinguish recoverable from irrecoverable errors
The exponential back off retry mechanism of the engine for calculation jobs has greatly improved the robustness with respect to temporary recoverable problems such as a loss of internet connection or clusters being down. However, since this indiscriminately applied to all errors during transport tasks, sometimes the tasks are restarted even though it will never work. Take for example where the wrong scheduler parameters are passed as inputs. No matter how often one restarts, the task will always fail. Retrying the task is futile.
To solve this, the scheduler plugin interfaces with the engine need to be improved, such that they can make a distinction between recoverable and irrecoverable errors. The default assumption for the engine will remain that the error is recoverable with a retry, but with this new option a scheduler plugin can instruct the engine to not even bother when it encounters certain errors. Since these errors are going to be scheduler and transport specific, their respective plugins should be responsible for this.
Just to give an example of one specific case:
File "/home/efl/work/devel/aiida_core/aiida/schedulers/plugins/slurm.py", line 431, in _parse_submit_output
"stdout={}\nstderr={}".format(retval, stdout, stderr))
aiida.schedulers.scheduler.SchedulerError: Error during submission, retval=1
stdout=
stderr=sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
06/03/2019 11:01:38 AM <23614> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] maximum attempts 5 of calling do_submit, exceeded
06/03/2019 11:01:39 AM <23614> aiida.engine.processes.calcjobs.tasks: [WARNING] submitting CalcJob<832> failed
This should stop at first submission and return some kind of usable exit code that is passed along.
Also, when we fix this, we most likely have to write a stdout
and stderr
scheduler parser of some sort. Since we need this for other purposes it would be nice to make it modular. For error handlers and cluster monitoring, it would be nice to be able to request some status of the stdout
and stderr
from the scheduler at any given time (meaning, we should open for the possibility of this request to go through the transport and parse at the cluster side).
In fact, for e.g. Slurm, there is an API, so no parsing is really needed. There is even PySlurm. This is probably not true for the general scheduler, but we should focus on at least having full support for PBS and Slurm.
Regarding your second comment, we have this open issue. Note, however, that part of the required code for that issue is indeed shared with his one, i.e. a parser for the stderr
and stdout
content returned by scheduler commands, allowing to act on those things while the calculation job is running is a quite a different problem and currently I am not sure how to implement this in the engine. Essentially when the calculation job is in the stage where it is querying the scheduler for a status update (the update
transport task), it should also request the content of the stderr
and stdout
written on the cluster for that job, perform some parsing and optionally kill the job. This means there needs to be some hook on the CalcJob
class that can implement this logic. If this logic is implemented (and not optionally disabled with some input setting on the CalcJob
) then the engine can also call this transport task, in addition to the UPDATE
one. I am not sure whether this should become a "second" transport task that is run at the same time as the UPDATE
one, since that might be complicated from the engine's perspective. Or the UPDATE
needs to be dynamically augmented with an additional step. To be discussed.
In fact, this issue: https://github.com/aiidateam/aiida-core/issues/2955#issuecomment-498181856 now keeps the state at Waiting
for the calculation. Finally it pauses. Before it failed after 5 tries, but now with the pause, it just hangs, even though there is no recovery possibly (given the info at hand). With the pause mechanism, it becomes even more important that we address #1925 as this error happens all the time if you go between different systems and accounts. Maybe we should consider to put this in before the release it published as I expect this issue to appear quite frequently.
This is a duplicate of #2226 which was closed, but we can continue discussion in this issue
+1 for adding parsing to the SLURM plugin to detect this error and return a corresponding exit code with instructions to provide a metadata.options.queue_name
to fix this
+1 Just ran into this again