aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

AiiDA scheduler plugin thinks job is finished while it is still running

Open ltalirz opened this issue 3 years ago • 14 comments

From time to time, AiiDA will retrieve the files in a calculation folder before the calculation has finished. When checking the calculation folder using gotocomputer at a later point in time, one finds that the calculation actually finished fine.

From a user perspective, this issue is very similar to https://github.com/aiidateam/aiida-core/issues/3404, but we've now established (see below) that there are two different failure modes. In the case discussed here, AiiDA gets an incorrect job list (which no longer contains the job in question) and therefore starts retrieving it.

===========

@sphuber We ran into this issue on helvetios/fidis.

Now, following up on your suggestion to investigate the failure mode, I have looked at the historical job records of both clusters during that time using sacct. In particular, one can query for the job ids and work directories of past jobs as follows:

# get work dirs of jobs since 2020-07-01 
sacct -S 2020-07-01 -u ongari --format "jobid,jobname%20,workdir%70" > sacct-list
cat sacct-list | grep scratch | grep aiida_run | awk '{print $3}' > directories
cat directories | wc -l  # prints 4705
cat directories | sort -u | wc -l  # prints 4705

It turned out that the list of work directories is unique. If the reason for the failure was that AiiDA submitted the same job twice (causing one or both of them to fail), then we should see the corresponding work directories appear twice in the list. I.e. in this particular case, I believe the issue occurs at the point where AiiDA queries the scheduler for the job status and gets an incomplete/incorrect answer.

As mentioned during our discussion, one solution of this issue could be to add a fail-safe mechanism when AiiDA gets the information that a job has completed. This could be any of

  • running squeue again, just to be sure
  • using an alternative command, that queries the status of that job specifically, such as
    • scontrol show job <id>
    • sacct -j <id>

and only accepting the fact that the job is no longer running, if the output agrees.

Mentioning @danieleongari for info

Originally posted by @ltalirz in https://github.com/aiidateam/aiida-core/issues/3404#issuecomment-672951815

ltalirz avatar Aug 26 '20 20:08 ltalirz

Following a brief discussion with @giovannipizzi and @sphuber , here follows some inspection of one of the failed calculation nodes:

$ verdi process report 896549
*** 896549 [CELL_OPT]: None
*** (empty scheduler output file)
*** (empty scheduler errors file)
*** 1 LOG MESSAGES:
+-> REPORT at 2020-08-10 21:16:13.126240+00:00
 | [896549|Cp2kCalculation|on_except]: Traceback (most recent call last):
 |   File "/home/daniele/anaconda3/envs/aiida1/lib/python3.6/site-packages/plumpy/process_states.py", line 225, in execute
 |     result = self.run_fn(*self.args, **self.kwargs)
 |   File "/home/daniele/aiida1/aiida_core/aiida/engine/processes/calcjobs/calcjob.py", line 286, in parse
 |     exit_code = execmanager.parse_results(self, retrieved_temporary_folder)
 |   File "/home/daniele/aiida1/aiida_core/aiida/engine/daemon/execmanager.py", line 439, in parse_results
 |     exit_code = parser.parse(**parse_kwargs)
 |   File "/home/daniele/aiida1/aiida-cp2k/aiida_cp2k/parsers/__init__.py", line 34, in parse
 |     exit_code = self._parse_stdout()
 |   File "/home/daniele/aiida1/aiida-lsmo/aiida_lsmo/parsers/__init__.py", line 62, in _parse_stdout
 |     raise OutputParsingError("CP2K did not finish properly.")
 | aiida.common.exceptions.OutputParsingError: CP2K did not finish properly.

And its attributes:

In [1]: c = load_node(896549)

In [2]: c.attributes
Out[2]:
{'job_id': '1219409',
 'sealed': True,
 'version': {'core': '1.3.0', 'plugin': '1.1.0'},
 'withmpi': True,
 'exception': 'aiida.common.exceptions.OutputParsingError: CP2K did not finish properly.\n',
 'resources': {'num_machines': 2, 'num_mpiprocs_per_machine': 36},
 'append_text': '',
 'parser_name': 'lsmo.cp2k_advanced_parser',
 'prepend_text': '',
 'last_job_info': {'title': 'aiida-896549',
  'job_id': '1219409',
  'raw_data': ['1219409',
   'R',
   'None',
   'h078',
   'ongari',
   '2',
   '72',
   'h[078,080]',
   'parallel',
   '3-00:00:00',
   '5:03:06',
   '2020-08-10T18:06:28',
   'aiida-896549',
   '2020-08-10T16:17:45'],
  'job_owner': 'ongari',
  'job_state': 'running',
  'annotation': 'None',
  'queue_name': 'parallel',
  'num_machines': 2,
  'num_mpiprocs': 72,
  'dispatch_time': {'date': '2020-08-10T18:06:28.000000', 'timezone': None},
  'submission_time': {'date': '2020-08-10T16:17:45.000000', 'timezone': None},
  'allocated_machines_raw': 'h[078,080]',
  'wallclock_time_seconds': 18186,
  'requested_wallclock_time_seconds': 259200},
 'process_label': 'Cp2kCalculation',
 'process_state': 'excepted',
 'retrieve_list': ['aiida.out',
  'aiida-1.restart',
  'aiida-pos-1.dcd',
  '_scheduler-stdout.txt',
  '_scheduler-stderr.txt'],
 'input_filename': 'aiida.inp',
 'remote_workdir': '/scratch/ongari/aiida_run/d4/11/b7ad-c962-4997-bab6-0788b3fdec49',
 'output_filename': 'aiida.out',
 'scheduler_state': 'done',
 'scheduler_stderr': '_scheduler-stderr.txt',
 'scheduler_stdout': '_scheduler-stdout.txt',
 'detailed_job_info': {'retval': 0,
  'stderr': '',
  'stdout': 'AllocCPUS|Account|AssocID|AveCPU|AvePages|AveRSS|AveVMSize|Cluster|Comment|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobName|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOSRAW|ReqCPUS|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|\n72|lsmo|882|||||helvetios||15-10:15:36|1332936|0:0|05:08:33|2020-08-10T16:17:45|Unknown|0:0|10697|lsmo|1219409|aiida-896549||||||||||72|2|h[078,080]||5838|parallel|1|72|01:48:43|5-10:27:36|469656|2020-08-10T18:06:28|RUNNING|2020-08-10T16:17:45|00:00:00|00:00:00|3-00:00:00|00:00:00|162182|ongari|00:00:00|\n36|lsmo|882|||||helvetios||7-17:07:48|666468||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.batch|batch||||||||||36|1|h078|1||||36||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|\n72|lsmo|882|||||helvetios||15-10:15:36|1332936||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.extern|extern||||||||||72|2|h[078,080]|2||||72||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|\n72|lsmo|882|||||helvetios||15-10:13:12|1332792||05:08:31|2020-08-10T18:06:30|Unknown|0:0|||1219409.0|cp2k.popt||||||||||72|2|h[078,080]|72||||72||||2020-08-10T18:06:30|RUNNING|2020-08-10T18:06:30|00:00:00|00:00:00||00:00:00|||00:00:00|\n'},
 'mpirun_extra_params': [],
 'environment_variables': {},
 'max_wallclock_seconds': 259200,
 'import_sys_environment': True,
 'submit_script_filename': '_aiidasubmit.sh',
 'retrieve_temporary_list': [],
 'scheduler_lastchecktime': '2020-08-10T21:12:31.029024+00:00',
 'custom_scheduler_commands': ''}

The "smoking gun" would seem to be in the detailed job info, which clearly specifies that, according to slurm, the job is still in state RUNNING:

AllocCPUS|Account|AssocID|AveCPU|AvePages|AveRSS|AveVMSize|Cluster|Comment|CPUTime|CPUTimeRAW|DerivedExitCode|Elapsed|Eligible|End|ExitCode|GID|Group|JobID|JobName|MaxRSS|MaxRSSNode|MaxRSSTask|MaxVMSize|MaxVMSizeNode|MaxVMSizeTask|MinCPU|MinCPUNode|MinCPUTask|NCPUS|NNodes|NodeList|NTasks|Priority|Partition|QOSRAW|ReqCPUS|Reserved|ResvCPU|ResvCPURAW|Start|State|Submit|Suspended|SystemCPU|Timelimit|TotalCPU|UID|User|UserCPU|
72|lsmo|882|||||helvetios||15-10:15:36|1332936|0:0|05:08:33|2020-08-10T16:17:45|Unknown|0:0|10697|lsmo|1219409|aiida-896549||||||||||72|2|h[078,080]||5838|parallel|1|72|01:48:43|5-10:27:36|469656|2020-08-10T18:06:28|RUNNING|2020-08-10T16:17:45|00:00:00|00:00:00|3-00:00:00|00:00:00|162182|ongari|00:00:00|
36|lsmo|882|||||helvetios||7-17:07:48|666468||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.batch|batch||||||||||36|1|h078|1||||36||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|
72|lsmo|882|||||helvetios||15-10:15:36|1332936||05:08:33|2020-08-10T18:06:28|Unknown|0:0|||1219409.extern|extern||||||||||72|2|h[078,080]|2||||72||||2020-08-10T18:06:28|RUNNING|2020-08-10T18:06:28|00:00:00|00:00:00||00:00:00|||00:00:00|
72|lsmo|882|||||helvetios||15-10:13:12|1332792||05:08:31|2020-08-10T18:06:30|Unknown|0:0|||1219409.0|cp2k.popt||||||||||72|2|h[078,080]|72||||72||||2020-08-10T18:06:30|RUNNING|2020-08-10T18:06:30|00:00:00|00:00:00||00:00:00|||00:00:00|

For context: the detailed job info is recorded inside task_retrieve_job right before the files of the calculation are retrieved: https://github.com/aiidateam/aiida-core/blob/855ae82e42b8a50e6c507fe9083187a22fe2cfea/aiida/engine/processes/calcjobs/tasks.py#L258 and actually uses sacct for the slurm scheduler https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L233

When looking at the content of the folder today, despite the failed parsing step, the cp2k calculation on the cluster actually finished without issues.

This is compatible with my suspicion that this issue occurs when AiiDA is checking squeue (either on the aiida side or, very possibly, on the slurm side).

ltalirz avatar Aug 26 '20 20:08 ltalirz

From our discussion, possible explanations include:

  1. for whatever reason, the output of squeue is too long for the ssh buffer (~4 MB according to @giovannipizzi ) This is possible but unlikely since squeue tries to query the jobs per user https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L206-L207
  2. perhaps more likely: the output from squeue is incomplete and still parsed by aiida

Re 2. It turns out that AiiDA gets the return value of the joblist command and forwards it to the _parse_joblist_output function of the scheduler plugin, but the function for slurm ignores a nonzero exit status if stderr is empty https://github.com/aiidateam/aiida-core/blob/aa9a2cb519f96fef24746a7ffb8e5701107f2503/aiida/schedulers/plugins/slurm.py#L485-L499

This seems a dangerous practice to me and would probably be quite straightforward to improve.

P.S. Unfortunately, we don't have the complete daemon logs from the time of the failed calculation anymore to check exactly what was printed there. Searching through the recent log history, however, it does seem that this part of the code does catch socket timeouts on squeue from time to time, e.g. :

08/21/2020 09:12:45 AM <26803> aiida.scheduler.slurm: [WARNING] Warning in _parse_joblist_output, non-empty stderr='slurm_load_jobs error: Socket timed out on send/recv operation'
08/21/2020 09:12:45 AM <26803> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/daniele/aiida1/aiida_core/aiida/engine/transports.py", line 103, in request_transport
    yield transport_request.future
  File "/home/daniele/aiida1/aiida_core/aiida/engine/processes/calcjobs/manager.py", line 106, in _get_jobs_from_scheduler
    scheduler_response = scheduler.get_jobs(**kwargs)
  File "/home/daniele/aiida1/aiida_core/aiida/schedulers/scheduler.py", line 340, in get_jobs
    joblist = self._parse_joblist_output(retval, stdout, stderr)
  File "/home/daniele/aiida1/aiida_core/aiida/schedulers/plugins/slurm.py", line 499, in _parse_joblist_output
    raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)')
aiida.schedulers.scheduler.SchedulerError: Error during squeue parsing (_parse_joblist_output function)

In these cases, stderr was non-empty and the scheduler error was raised as designed - but perhaps under high load / special circumstances, stderr can remain empty?

ltalirz avatar Aug 26 '20 20:08 ltalirz

@giovannipizzi Can the logic be changed to something like:

if 'invalid job id specified' not in stderr:
    if stderr.strip():
        self.logger.warning("Warning in _parse_joblist_output, non-empty stderr='{}'".format(stderr.strip()))
    if retval !=0:
        raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)')

From your comments in the code it is not entirely clear to me whether the error raised by squeue when specifying job ids of jobs that are no longer known to squeue also results in some identifiable trace in stderr. In my experiments on fidis (slurm 19.05.3-2), I cannot even get a non-zero exit code when providing non-existing job ids to squeue:

$ squeue -u ongari
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5117691    serial aiida-95   ongari  R    2:10:49      1 f004
           5113933    serial aiida-95   ongari  R    3:46:09      1 f004
$ squeue -u ongari --jobs=5117691,5113933
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5117691    serial aiida-95   ongari  R    2:11:04      1 f004
           5113933    serial aiida-95   ongari  R    3:46:24      1 f004
$ squeue -u ongari --jobs=51133333,5117691
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5117691    serial aiida-95   ongari  R    2:11:42      1 f004
$ echo $?
0

Same with slurm 20.02.2 at CSCS

ltalirz avatar Aug 26 '20 20:08 ltalirz

@giovannipizzi I now also tested with slurm 17.11 on Quantum Mobile:

$ squeue --version
slurm-wlm 17.11.2
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ squeue --jobs=123
slurm_load_jobs error: Invalid job id specified
$ squeue --jobs=123,124
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ echo $?
0

I.e. so far I have encountered only two responses:

  1. single job id: returns error slurm_load_jobs error: Invalid job id specified when specifying an invalid one
  2. list of job ids: returns the ones it finds; never complains about invalid ones

Is there a slurm version that has a different behavior for 2.?

ltalirz avatar Aug 27 '20 09:08 ltalirz

I don't remember the exact details of the old behaviour (but in the header it's written which version it was).

Also, I don't know if the answer you get depends also on whether the (competed) job has been removed from the SLURM history or not.

On Daint/Fidis, is the logic the same (a single job gives error, multiple don't)? In any case, isn't this a problem anyway? The length of the list depends on the number of queued jobs, and it starts to be tricky to tackle these different cases (1 vs >1) differently...

Also, maybe what happens is that there is the string printed in output, and then it crashes? Could this have been the case? I think the old logic would have tried to parse with the consequences described. But also your suggestion would have the same issue. Is there a way to check if this is the case, e.g. increasing the verbosity? How often does the problem appear?

giovannipizzi avatar Aug 27 '20 10:08 giovannipizzi

See also this thread https://github.com/mllg/batchtools/issues/201 for this and similar timeout issues.

Maybe we want to say that if there is more stderr, we stop? Is this going to be safe enough? Or we risk that things will remain queued forever because of some additional stderr on some machines?

giovannipizzi avatar Aug 27 '20 10:08 giovannipizzi

Also, I don't know if the answer you get depends also on whether the (competed) job has been removed from the SLURM history or not.

I just checked - it does not (would also be very surprising if it did).

On Daint/Fidis, is the logic the same (a single job gives error, multiple don't)?

Yes (I checked).

In any case, isn't this a problem anyway? The length of the list depends on the number of queued jobs, and it starts to be tricky to tackle these different cases (1 vs >1) differently...

Right. I checked that simply appending a comma does not change things. However, we could purposefully append the last jobid twice.

[talirz@fidis ~]$ squeue -u ongari --jobs=5125042
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5125042    serial aiida-95   ongari  R    1:40:27      1 g202
[talirz@fidis ~]$ squeue -u ongari --jobs=512504
slurm_load_jobs error: Invalid job id specified
[talirz@fidis ~]$ squeue -u ongari --jobs=512504,
slurm_load_jobs error: Invalid job id specified
[talirz@fidis ~]$ squeue -u ongari --jobs=512504,512504
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
[talirz@fidis ~]$ squeue -u ongari --jobs=5125042,5125042
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5125042    serial aiida-95   ongari  R    1:47:45      1 g202

This results in the same behavior of slurm independently of the length of the list of jobs.

Alternatively, we implement a simple logic: if list of jobs we are querying for has length 1 (I hope there is a way to access this from _parse_joblist_output), ignore the non-zero exit code if stderr contains "invalid jobid error". Otherwise never ignore a non-zero exit code.

Also, maybe what happens is that there is the string printed in output, and then it crashes? Could this have been the case? I think the old logic would have tried to parse with the consequences described. But also your suggestion would have the same issue.

Well, I was assuming that the exit code in case of a crash is non-zero (hopefully?). In this case, my logic would not suffer from the same issue. In general, I think it is very dangerous to ignore a nonzero exit code unless we are absolutely sure it is safe to ignore.

Is there a way to check if this is the case, e.g. increasing the verbosity? How often does the problem appear?

Looking at the squeue manpage there is an option for verbose output but I doubt this will help

$ squeue -u ongari --jobs=5125042,5125042 -v
-----------------------------
all         = false
array       = false
federation  = false
format      = (null)
iterate     = 0
job_flag    = 1
jobs        = 5125042,5125042
licenses    = (null)
local       = false
names       = (null)
nodes       =
partitions  = (null)
priority    = false
reservation = (null)
sibling      = false
sort        = (null)
start_flag  = 0
states      = (null)
step_flag   = 0
steps       = (null)
users       = ongari
verbose     = 1
-----------------------------


Thu Aug 27 12:56:46 2020
squeue: Linear node selection plugin loaded with argument 20
squeue: Cray/Aries node selection plugin loaded
squeue: Consumable Resources (CR) Node Selection plugin loaded with argument 20
squeue: select/cons_tres loaded with argument 20
last_update_time=0 records=1
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           5125042    serial aiida-95   ongari  R    2:06:01      1 g202

There may be a record of squeue crashes at the operating system level (e.g. in /var/log/crashes or similar) but that is something to ask the slurm admins.

Maybe we want to say that if there is more stderr, we stop? Is this going to be safe enough? Or we risk that things will remain queued forever because of some additional stderr on some machines?

Well, first of all I think we should not be ignoring a non-zero exit code (irrespective of whether stderr is empty or not). Then, if the exit code is non-zero we can look at stderr and decide what to do (depending on how we decide to handle the case of single vs multiple jobs).

ltalirz avatar Aug 27 '20 11:08 ltalirz

However, we could purposefully append the last jobid twice.

If we go this route, make sure to include an extensive comment :sweat_smile:

sphuber avatar Aug 27 '20 11:08 sphuber

However, we could purposefully append the last jobid twice.

Very clever! I agree it's a bit of a trick, though. Maybe better to just do if/else then (and hope that our inference of the behaviour of slurs is correct, also for other versions...

I was assuming that the exit code in case of a crash is non-zero (hopefully?). In this case, my logic would not suffer from the same issue.

Are you sure? I think you wrote:

if 'invalid job id specified' not in stderr:
    if stderr.strip():
        self.logger.warning("Warning in _parse_joblist_output, non-empty stderr='{}'".format(stderr.strip()))
    if retval !=0:
        raise SchedulerError('Error during squeue parsing (_parse_joblist_output function)')

So, if the stderr includes invalid job id specified at any point, plus some other serious error, you will skip the whole block and never raise, independent of the retail. My suggestion would be to be more restrictive and only check if stderr.strip() != 'invalid job id specified' or something like this (to be tested, though).

In general, I think it is very dangerous to ignore a nonzero exit code unless we are absolutely sure it is safe to ignore.

I agree. At the time of the first implementation, I couldn't find a better way.

There may be a record of squeue crashes at the operating system level (e.g. in /var/log/crashes or similar) but that is something to ask the slurm admins.

You might want to write to the admins, you have the job ID and a quite precise time when this happened. At least we can confirm the issue and prepare a proper fix.

Feel free to prepare some PR improving on the current situation - ideally, it would be great however if we manage to reproduce at least once the problem, with some extensive logging that shows the exact stderr and stdout, so we can confirm that we have a good fix. In the process, you would be collecting also a lot of 'valid' stdout/stderr, so this will also be good to double check we are not now creating the opposite problem - being overzealous and ending up never updating the job state.

giovannipizzi avatar Aug 27 '20 12:08 giovannipizzi

Maybe better to just do if/else then (and hope that our inference of the behaviour of slurs is correct, also for other versions...

Concerning this, I had a question was whether I can access the original joblist from the _parse_joblist_output function. It seems to me that this information is not stored anywhere (?) which means one needs to parse the output without knowing what the input was.

That makes it a bit more fragile.

Are you sure? I think you wrote: ... So, if the stderr includes invalid job id specified at any point, plus some other serious error, you will skip the whole block and never raise, independent of the retail. My suggestion would be to be more restrictive and only check if stderr.strip() != 'invalid job id specified' or something like this (to be tested, though).

Perhaps I misunderstood what you meant by "this issue" - I was referring to the fact that a non-zero exit code with empty stderr would previously go undetected (which would be fixed by the logic I wrote).

While it's true that it may still be problematic to ignore the exit code simply because we find invalid job id specified in stderr (which would also happen currently), I'm not sure it is better to enforce that stderr needs to exactly equal a certain string - for the reason you mention, other stuff might end up in stderr and I see no way for us to test this as it may depend on the supercomputer setup etc.

If you want to solve this remaining problem as well, I don't really see another solution than the modification I proposed for the joblist passed to squeue, i.e. making sure (in one way or another) that all the squeue commands we issue are expected to return exit code 0.

You might want to write to the admins, you have the job ID and a quite precise time when this happened. At least we can confirm the issue and prepare a proper fix.

I will contact them - let's see whether we can find something out.

ltalirz avatar Aug 27 '20 13:08 ltalirz

Ok - I don't think you can access the list of jobs from _parse_joblist_output, but you can make assumptions on how it was called (i.e., with the corresponding method in the same class). We are already doing it e.g. because we put some ENVVARS to set the format etc.

So I think it's ok to proceed with the trick of putting twice the last job id (and explaining why)

giovannipizzi avatar Aug 27 '20 16:08 giovannipizzi

Unfortunately, we need to reopen this issue since #4332 was evidently not enough to resolve it - we are still encountering this in AiiDA 1.5.2.

One new piece of information: I have encountered this issue on a local slurm installation when submitting a medium-size amount of jobs (say, 100) which should really not be enough to overload slurm in the sense that it starts behaving outside specifications.

Connection cooldown time was 0s (local computer); I did not touch the minimum scheduler poll interval, so I assume it is 10s.

I have access to the slurm logs and I can see that

  • 40 CalcJobs failed because of early retrieval that were created within 1 minute of each other (Node.ctime)
  • Node.get_scheduler_lastchecktime() of all failing jobs were within .5s of each other (i.e. they used the same joblist)
  • in the cases I checked, get_scheduler_lastchecktime() is at least two minutes after the jobs were queued according to the slurm logs (i.e. sbatch ... should have completed), but (15s to several minutes) before the jobs started running according to slurm logs

~~@sphuber To me, this poses a very basic question: Is there any logic in the engine ensuring that job submission (e.g. sbatch ...) has completed (not just been scheduled) before the first check of the queue is being done?~~

~~If not (and if all of this is happening in somewhat random asynchronous order) then I believe this is the source of this issue.~~

Edit: Sorry, I got a bit ahead of myself there, comparing different timings. I've edited the list now so that it correctly reflects what I understand now. In particular, I don't seem to have any cases of jobs being retrieved before submission to slurm has completed, so my original thought of the reason may have been wrong.

ltalirz avatar Feb 16 '21 18:02 ltalirz

So let's see if we can get the timeline correct here

  • AiiDA calls sbatch
  • SLURM actually creates the job entry
  • AiiDA queries SLURM
  • AiiDA starts to retrieve
  • SLURM actually starts running the scheduled job

Even though you striked out your initial suspicion, just to answer your question: AiiDA will decide whether the sbatch command was successful based on the return value. So it is possible that if SLURM receives the call and returns 0, but then somewhere an error in SLURM occurs and the job gets dropped, AiiDA incorrectly assumes that the job has been scheduled and will run at some point time. Conversely, if SLURM receives the call and schedules the job, but then fails to send the response, AiiDA will assume the entire command fail and will try to resubmit. This latter issue is behavior that we have observed, where AiiDA submits a single job twice, because the first time it incorrectly assumes it failed.

That being said, as you stated, in your case, none of these seem to apply, because the jobs are actually confirmed to be scheduled by SLURM before the last check time of AiiDA. Somehow though, at some query iteration, AiiDA still decides they are completed and starts to retrieve, even though they are actually still queued in SLURM. Here we should really just see what the exact response was of the squeue command that AiiDA called in the update task. AiiDA decides only in one single way that a job has terminated: it is no longer present in the list of jobids returned by squeue. If somehow a still queued job is missing from that list (or an internal analysis bug incorrectly misses it) then it will incorrectly be estimated to be terminated and start retrieving. Do you have the debug statement of AiiDA in your logs from the update task where it decides those jobs are done? Because it has to be the case that the pks of the jobs are not there, even though they should be.

sphuber avatar Feb 17 '21 08:02 sphuber

As referenced in the mailing list, I've been encountering some issues with the scheduler mistaking jobs as completed when they are queued by the scheduler or in progress. Note that it's somewhat difficult to filter out the issue since it occurs usually as a result of queuing multiple processes. I'll try to paste relevant information here:

$ verdi process report 55681

*** 55681: None                                                                                             
*** Scheduler output: N/A                                                                                   
*** Scheduler errors: N/A                                                                                   
*** 6 LOG MESSAGES:                                                                                         
+-> WARNING at 2022-01-28 00:48:17.327060+00:00                                                             
 | could not parse scheduler output: the `_scheduler-stderr.txt` file is missing                            
+-> WARNING at 2022-01-28 00:48:17.328889+00:00                                                             
 | could not parse scheduler output: the `_scheduler-stdout.txt` file is missing                            
+-> WARNING at 2022-01-28 00:48:17.409309+00:00                                                             
 | key 'symmetries' is not present in raw output dictionary                                                 
+-> ERROR at 2022-01-28 00:48:17.446123+00:00                                                               
 | ERROR_OUTPUT_STDOUT_INCOMPLETE                                                                           
+-> ERROR at 2022-01-28 00:48:17.448572+00:00                                                               
 | Both the stdout and XML output files could not be read or parsed.                                        
+-> WARNING at 2022-01-28 00:48:17.450076+00:00                                                             
 | output parser returned exit code<305>: Both the stdout and XML output files could not be read or parsed. 

Following is from the log file (verbosity = DEBUG) grepped by process PK

01/27/2022 06:44:11 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Loaded process<55681> from saved state
01/27/2022 06:44:11 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:44:11 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.created.running
01/27/2022 06:44:11 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:44:11 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.running.waiting
01/27/2022 06:44:11 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] scheduled request to upload CalcJob<55681>
01/27/2022 06:45:09 PM <1640958> aiida.execmanager: [DEBUG] [submission of calculation 55681] copying file/folder pseudo...
01/27/2022 06:45:10 PM <1640958> aiida.execmanager: [DEBUG] [submission of calculation 55681] copying file/folder out...
01/27/2022 06:45:10 PM <1640958> aiida.execmanager: [DEBUG] [submission of calculation 55681] copying file/folder _aiidasubmit.sh...
01/27/2022 06:45:10 PM <1640958> aiida.execmanager: [DEBUG] [submission of calculation 55681] copying file/folder .aiida...
01/27/2022 06:45:10 PM <1640958> aiida.execmanager: [DEBUG] [submission of calculation 55681] copying file/folder aiida.in...
01/27/2022 06:45:10 PM <1640958> aiida.execmanager: [DEBUG] [submission of calculation 55681] copying file/folder environ.in...
01/27/2022 06:45:10 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] uploading CalcJob<55681> successful
01/27/2022 06:45:11 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:45:11 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.waiting.waiting
01/27/2022 06:45:11 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] scheduled request to submit CalcJob<55681>
01/27/2022 06:46:12 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] submitting CalcJob<55681> successful
01/27/2022 06:46:12 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:46:12 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.waiting.waiting
01/27/2022 06:46:13 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] scheduled request to update CalcJob<55681>
01/27/2022 06:47:13 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] updating CalcJob<55681> successful
01/27/2022 06:47:13 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:47:13 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.waiting.waiting
01/27/2022 06:47:13 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] scheduled request to retrieve CalcJob<55681>
01/27/2022 06:48:14 PM <1640958> aiida.execmanager: [DEBUG] Retrieving calc 55681
01/27/2022 06:48:14 PM <1640958> aiida.execmanager: [DEBUG] [retrieval of calc 55681] chdir /storage/scratch/ps0579/Solvation_work/f3/01/4dcc-6576-4885-8778-2795f55c4320
01/27/2022 06:48:14 PM <1640958> aiida.transport.SshTransport: [DEBUG] [retrieval of calc 55681] Trying to retrieve remote item 'aiida.out'
01/27/2022 06:48:14 PM <1640958> aiida.transport.SshTransport: [DEBUG] [retrieval of calc 55681] Trying to retrieve remote item './out/aiida.save/data-file-schema.xml'
01/27/2022 06:48:14 PM <1640958> aiida.transport.SshTransport: [DEBUG] [retrieval of calc 55681] Trying to retrieve remote item './out/aiida.save/data-file.xml'
01/27/2022 06:48:14 PM <1640958> aiida.transport.SshTransport: [DEBUG] [retrieval of calc 55681] Trying to retrieve remote item 'environ.debug'
01/27/2022 06:48:14 PM <1640958> aiida.transport.SshTransport: [DEBUG] [retrieval of calc 55681] Trying to retrieve remote item '_scheduler-stdout.txt'
01/27/2022 06:48:14 PM <1640958> aiida.transport.SshTransport: [DEBUG] [retrieval of calc 55681] Trying to retrieve remote item '_scheduler-stderr.txt'
01/27/2022 06:48:14 PM <1640958> aiida.execmanager: [DEBUG] [retrieval of calc 55681] Storing retrieved_files=None
01/27/2022 06:48:15 PM <1640958> aiida.engine.processes.calcjobs.tasks: [INFO] retrieving CalcJob<55681> successful
01/27/2022 06:48:17 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.waiting.running
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.running.finished

The tracejob for a similar failed job:

Job: 256025.cruntch3.chem.unt.edu
01/28/2022 15:03:19  S    enqueuing into route, state 1 hop 1
01/28/2022 15:03:19  S    dequeuing from route, state QUEUED
01/28/2022 15:03:19  S    enqueuing into share.20, state 1 hop 1
01/28/2022 15:03:19  A    queue=route
01/28/2022 15:03:19  A    queue=share.20
01/28/2022 15:06:50  S    Job Run at request of [email protected]
01/28/2022 15:06:50  S    Not sending email: job requested no e-mail
01/28/2022 15:06:50  A    user=ps0579 group=ps0579 jobname=aiida-61231 queue=share.20 ctime=1643403799 qtime=1643403799 etime=1643403799 start=1643404010 [email protected]
                          exec_host=c12-1-4/0+c12-1-4/1+c12-1-4/2+c12-1-4/3+c12-1-4/4+c12-1-4/5+c12-1-4/6+c12-1-4/7+c12-1-4/8+c12-1-4/9+c12-1-4/10+c12-1-4/11+c12-1-4/12+c12-1-4/13+c12-1-4/14+c12-1-4/15+c12-1-4/16+c12-1-4/17+c12-1-4/18+c12-1-4/19
                          Resource_List.neednodes=1:ppn=20 Resource_List.nodect=1 Resource_List.nodes=1:ppn=20
01/28/2022 15:08:03  S    Exit_status=0 resources_used.cput=00:24:10 resources_used.mem=7577060kb resources_used.vmem=16886168kb resources_used.walltime=00:01:13
01/28/2022 15:08:03  S    Not sending email: job requested no e-mail
01/28/2022 15:08:03  S    on_job_exit valid pjob: 256025.cruntch3.chem.unt.edu (substate=50)
01/28/2022 15:08:03  S    dequeuing from share.20, state COMPLETE
01/28/2022 15:08:03  A    user=ps0579 group=ps0579 jobname=aiida-61231 queue=share.20 ctime=1643403799 qtime=1643403799 etime=1643403799 start=1643404010 [email protected]
                          exec_host=c12-1-4/0+c12-1-4/1+c12-1-4/2+c12-1-4/3+c12-1-4/4+c12-1-4/5+c12-1-4/6+c12-1-4/7+c12-1-4/8+c12-1-4/9+c12-1-4/10+c12-1-4/11+c12-1-4/12+c12-1-4/13+c12-1-4/14+c12-1-4/15+c12-1-4/16+c12-1-4/17+c12-1-4/18+c12-1-4/19
                          Resource_List.neednodes=1:ppn=20 Resource_List.nodect=1 Resource_List.nodes=1:ppn=20 session=46317 end=1643404083 Exit_status=0 resources_used.cput=00:24:10
                          resources_used.mem=7577060kb resources_used.vmem=16886168kb resources_used.walltime=00:01:13

logfile surrounding an error

01/27/2022 06:48:16 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55645>: Broadcasting state change: state_changed.running.finished
01/27/2022 06:48:16 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55657>
01/27/2022 06:48:16 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55657>: Broadcasting state change: state_changed.waiting.running
01/27/2022 06:48:16 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] `TorqueScheduler` does not implement scheduler output parsing
01/27/2022 06:48:16 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55657>: Broadcasting state change: state_changed.running.finished
01/27/2022 06:48:16 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55669>
01/27/2022 06:48:16 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55669>: Broadcasting state change: state_changed.waiting.running
01/27/2022 06:48:16 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] `TorqueScheduler` does not implement scheduler output parsing
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55669>: Broadcasting state change: state_changed.running.finished
01/27/2022 06:48:17 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55681>
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.waiting.running
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] could not parse scheduler output: the `_scheduler-stderr.txt` file is missing
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] could not parse scheduler output: the `_scheduler-stdout.txt` file is missing
01/27/2022 06:48:17 PM <1640958> aiida.parser.EnvPwParser: [WARNING] key 'symmetries' is not present in raw output dictionary
01/27/2022 06:48:17 PM <1640958> aiida.parser.EnvPwParser: [ERROR] ERROR_OUTPUT_STDOUT_INCOMPLETE
01/27/2022 06:48:17 PM <1640958> aiida.parser.EnvPwParser: [ERROR] Both the stdout and XML output files could not be read or parsed.
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] output parser returned exit code<305>: Both the stdout and XML output files could not be read or parsed.
01/27/2022 06:48:17 PM <1640958> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [INFO] Process<55681>: Broadcasting state change: state_changed.running.finished
01/27/2022 06:48:17 PM <1640958> aiida.engine.persistence: [DEBUG] Persisting process<55693>

@sphuber to address your comment posted to the aiida group, initial tests indicate that no extra jobs are spawned (a count of jobs on the scheduler matches the count of processes submitting jobs), but I'll have to investigate this further in order to have a better idea of what's going on.

matthew-truscott avatar Jan 28 '22 21:01 matthew-truscott