aiida-quantumespresso icon indicating copy to clipboard operation
aiida-quantumespresso copied to clipboard

`PwBaseWorkChain`: Restart from scheduler exit code 120

Open mbercx opened this issue 3 years ago • 3 comments

Currently, a calculation that is cancelled because it hit the walltime, but didn't exit gracefully by QE, isn't restarted by the PwBaseWorkChain because the returned exit code is 312. Looking at the report, however, the issue is caught by the scheduler parser, which returns exit code 120:

*** Scheduler errors:
slurmstepd: error: *** STEP 33830794.0 ON nid04797 CANCELLED AT 2021-09-15T21:58:56 DUE TO TIME LIMIT ***
slurmstepd: error: *** JOB 33830794 ON nid04797 CANCELLED AT 2021-09-15T21:58:56 DUE TO TIME LIMIT ***
srun: got SIGCONT
srun: forcing job termination

*** 4 LOG MESSAGES:
+-> WARNING at 2021-09-15 22:02:35.995520+02:00
 | scheduler parser returned exit code<120>: The job ran out of walltime.
+-> ERROR at 2021-09-15 22:02:37.966779+02:00
 | The stdout output file was incomplete probably because the calculation got interrupted.
+-> ERROR at 2021-09-15 22:02:37.963508+02:00
 | ERROR_OUTPUT_STDOUT_INCOMPLETE
+-> WARNING at 2021-09-15 22:02:37.970437+02:00
 | output parser returned exit code<312>: The stdout output file was incomplete probably because the calculation got interrupted.

If I remember correctly, we should check in the PwParser if such an exit code is present on the node, and then return this exit code. We can then implement an error handler for the PwBaseWorkChain, or add it to the current out-of-walltime handler in case we agree that the restart mode should be the same (I think there a difference in output files for these two cases? Or are charge density and wave functions are written for every electronic step?).

The exit code would have to be returned before the stdout is parsed, since this will lead to the ERROR_OUTPUT_STDOUT_INCOMPLETE exit code to be returned, overriding the exit code set by the scheduler parsing.

mbercx avatar Sep 15 '21 20:09 mbercx

I think it would indeed be good to detect this exit code (indeed we can and should just add a line to the parser that checks if an exit code has already been set and then return it, after having done some other stuff that can still be interesting) but I am not sure we can actually handle this as a normal timeout. The problem is that the files are most likely in an inconsistent state and the restart will fail. The only real fix is to increase the resources, but for now I don't think we should do this in the base workchain but the caller should take care of this

sphuber avatar Sep 16 '21 15:09 sphuber

The problem is that the files are most likely in an inconsistent state and the restart will fail.

Hmm, I'll have to check which files are consistent or not. Is e.g. the charge density and XML not written in every ionic step of a vc-relax? In this case it might be good to restart from these (structure from XML and startingpot = 'file'). Unfortunately it still happens quite often that running into the walltime for relaxations doesn't end the run gracefully, especially for larger structures where the loss of computational time due to a lack of restart is greater.

mbercx avatar Sep 16 '21 16:09 mbercx

It would definitely be good to check to see if there are cases in which we can still restart. The difficulty with this error is that it is really unpredictable. It can happy at any moment in time. For example, if it happens just after a checkpoint (if such a thing exists in QE) where all files have been written in a consistent format, than you could restart. However, what I have seen happen oftentimes is that the process is killed while it is writing the output and so parsing fails but also a restart will fail. It would be good if the parser can detect the 120 and then inspect the state of the output files. If they are broken, we return 120, but if they are in a good state, we can set 400 or maybe a slightly different one to still indicate the difference with a proper 400.

sphuber avatar Sep 17 '21 07:09 sphuber