aiida-quantumespresso icon indicating copy to clipboard operation
aiida-quantumespresso copied to clipboard

`PwBaseWorkChain`: revisit restarting from calculations

Open mbercx opened this issue 4 years ago • 9 comments

Currently the PwBaseWorkChain keeps track of a parent calculation to restart from in the restart_calc variable in the context. For example, when the user provides a parent_folder input to restart the calculation from, its creator is assigned to the restart_calc context variable here:

https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L245-L246

This context variable is later used to set the CONTROL.restart_mode input tag of the pw.x calculation:

https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L415-L420

However, there is a problem in case the parent_folder has no creator, for example when the user has stashed some calculation outputs like the charge density from which she/he later wants to restart.

I think we should adapt the code so the creator is no longer necessary.

The restart_mode input tag

EDIT: I've edited the message below because I had made a mistake during my tests on this issue.

The explanation on the restart_mode is a little confusing, and gives the impression that it can only be used to restart from an interrupted run. I've done a bit of testing on a simple SCF calculation on Si to see what happens when you set this to 'restart', compared to using startingpot or startingwfc:

  • restart_mode = 'restart': You find the following lines in the output file:

       The initial density is read from file :
       ./out/aiida.save/charge-density
    
       Starting wfcs from file
    

    The calculation finishes in 1 SCF step, and the total CPU time is 6.09s.

  • startingpot = 'file': You find the following lines in the output file:

       The initial density is read from file :
       ./out/aiida.save/charge-density
    
       Starting wfcs are    8 randomized atomic wfcs
    

    The calculation finishes in 1 SCF step, and the total CPU time is 11.99s.

  • startingwfc = 'file': You find the following lines in the output file:

       Initial potential from superposition of free atoms
    
       starting charge    7.99888, renormalised to    8.00000
       Starting wfcs from file
    
       total cpu time spent up to now is        0.5 secs
    
       Self-consistent Calculation
    

    The calculation finished in 7 SCF steps, and the total CPU time is 17.99s.

  • startingpot = 'file' AND startingwfc = 'file': Same as restart_mode = 'restart'

So, it seems that:

  • setting restart_mode to 'restart' does restart both interrupted and completed runs properly. In this case both the wave functions and charge density are read from file, which gives a similar result as setting both startingpot and startingwfcto'file'`.

  • For startingpot = 'file', the calculation only requires 1 SCF step, but it takes longer because the initial wave function coefficients are random. Which is to be expected.

  • For startingwfc = 'file', it seems that the initial potential (or charge density) is still taken from the superposition of atomic charge densities. Because of this, the calculation still needs 7 SCF steps to converge and this restart method takes the longest. I suppose the question here is why the potential is not constructed from the wave functions before starting the SCF.

TL;DR

In the end, it seems that using restart_mode = 'restart' is indeed the way to go for restarting pw.x calculations for both interrupted and completed runs.

mbercx avatar May 19 '21 06:05 mbercx

Thanks to @MackeEric for first drawing my attention to this! Also pinging @ramirezfranciscof and @qiaojunfeng in case they have comments.

mbercx avatar May 19 '21 07:05 mbercx

So, I was a little rushed in running these tests this morning, and messed up the one for restart_mode = 'restart'. I've updated the OP to correct this.

@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.

mbercx avatar May 19 '21 11:05 mbercx

A caveat to using restart_mode = 'restart' is in case e.g. the number of bands has been changed. Then trying to load the wave functions will fail:

     The initial density is read from file :
     ./out/aiida.save/charge-density


 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     Error in routine pw_restart - read_collected_wfc (1):
     The number of bands for this run is    12, but only     8 bands were read from file
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

In this case, it is interesting to use startingpot = 'file', since only 1 SCF will be needed. We should probably use this setting in the sanity_check_insufficient_bands error handler, which claims to restart at the moment:

https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L434-L441

But as far as I can see it doesn't 😅 , which is what is also reported:

https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L476

So we probably should still distinguish between different restart modes to deal with cases like this.

mbercx avatar May 19 '21 12:05 mbercx

@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.

So in my experience, I have seen a somehow inconsistent behaviour, at least with QE v6.6. In a test scenario I designed on my own, I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.

However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run (even with nbnd and tot_magnetization set to the same values as outputted by the smearing run!). In principle, this should not cause any trouble but in reality it often does, yielding PW errors such as too many bands are not converged directly after the first iteration.

On the other hand, the very same restarts worked perfectly fine if I only used startingpot = 'file' while leaving restart_mode = 'restart'.

MackeEric avatar May 20 '21 15:05 MackeEric

Thanks @MackeEric!

I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.

One difference I realised while doing some more testing is that restart_mode = 'restart' also reads the structure from the previous calculation:

     Atomic positions and unit cell read from directory:
     ./out/aiida.save/
     Atomic positions from file used, from input discarded

So, when e.g. restarting from a previous relax calculation, using restart_mode = 'restart' just needs one SCF and ionic step, whereas setting startingpot = 'file' and startingwfc = 'file' will use of course use the initial structure.

However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run

Right, you weren't just restarting the same run, but adapting inputs. I can imagine using the wave functions from the 'smearing' run for a 'fixed' run causes troubles, since suddenly there will be states with partial occupations. I imagine similar errors will occur when using startingwfc = 'file' in this case. (Coindidentally, I did just try this for my simple Si SCF case and there I didn't run into any issues.)

So, using restart_mode = 'restart' is fine in case the run has completed, provided you don't change any calculation settings that cause trouble when loading the wave functions of the previous run.

mbercx avatar May 20 '21 16:05 mbercx

Hi everyone,

This days I did find @MackeEric very same problem ("too many bands are not converged"), while trying to "restart" a pw calculation in the hubbard workflow. When you want to start from a density of an other pw calculation, but not use the wfcs (as in, eg, the hubbard workflows), it is then necessary to be able to start the calculation using restart_mode='from_scratch' and startingpot='file', since the "smeared" wfcs could be too much different.

Could a solution be having a restart_folder and a parent_folder? The former is specifically used for restarting, while the latter is used to just copy the out folder (where one may want also to copy only specific part of it as an option for particular purposes? ).

Anyway, I do believe that a solution is needed, especially when I think about the hubbard workflow.

bastonero avatar Jun 02 '21 13:06 bastonero

Thanks for the comment @bastonero! I indeed think a possible path forward could be keeping track of a restart_folder instead of a restart_calc and updating the logic where needed to use startingpot = file instead of restart_mode = 'restart'. I just haven't looked into this in much detail, so I'm not sure if it leads to issues somewhere.

I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join?

mbercx avatar Jun 02 '21 14:06 mbercx

I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join? Yes, I think it would be great, thanks!

bastonero avatar Jun 02 '21 15:06 bastonero

Notes from meeting with @ramirezfranciscof, @timrov, @MackeEric and @bastonero:

  • We should consider only copying some files when restarting from a calculation. This should be fixable at the plugin level, and could help with reducing restart times when e.g. the wave functions are very big.
  • Question: When you try restart_mode = 'restart' and some of the outputs (e.g. wfc) are missing, does it crash the calculation?
  • Question: Is the indeed output structure is read from xml (is this the restart file the documentation mentions?) by default even if restart_mode = 'from_scratch'? See ion_positions input tag.
  • TODO: write document on all this for other QE users, and as a guideline for the discussion on making possible changes to QE (see below). Also check recent QE school for material on the topic.

Issues to raise on QE GitLab:

  • The charge density (potential) should be constructed from the wave functions in case startingwfc = 'file'.
  • Is there a problem with restarting with restart_mode = 'restart' outside of two interrupt cases suggested in the documentation? For example, if you restart from a calculation that has hit the maximum number of ionic iterations, is this ok?
  • Change the way restarting works in QE:
    1. restart_mode = 'restart' should not fail if something is missing. I.e. it should just check what outputs are there and use them to restart. (check question above)
    2. The individual settings for restarting (startingpot, startingwfc and ion_positions) should override the outputs that restart_mode uses, not the other way around.
    3. For consistency, it might actually be nice to rename ion_positions to e.g. startingion or something like that. Now it's not super clear that this is regarding restarting, and it would make it more in line with startingpot and startingwfc.

Pinging @giovannipizzi for comments, considering his considerable QE experience. 😁

mbercx avatar Jun 04 '21 13:06 mbercx