aiida-quantumespresso
aiida-quantumespresso copied to clipboard
`PwBaseWorkChain`: revisit restarting from calculations
Currently the PwBaseWorkChain keeps track of a parent calculation to restart from in the restart_calc variable in the context. For example, when the user provides a parent_folder input to restart the calculation from, its creator is assigned to the restart_calc context variable here:
https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L245-L246
This context variable is later used to set the CONTROL.restart_mode input tag of the pw.x calculation:
https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L415-L420
However, there is a problem in case the parent_folder has no creator, for example when the user has stashed some calculation outputs like the charge density from which she/he later wants to restart.
I think we should adapt the code so the creator is no longer necessary.
The restart_mode input tag
EDIT: I've edited the message below because I had made a mistake during my tests on this issue.
The explanation on the restart_mode is a little confusing, and gives the impression that it can only be used to restart from an interrupted run. I've done a bit of testing on a simple SCF calculation on Si to see what happens when you set this to 'restart', compared to using startingpot or startingwfc:
-
restart_mode = 'restart': You find the following lines in the output file:The initial density is read from file : ./out/aiida.save/charge-density Starting wfcs from fileThe calculation finishes in 1 SCF step, and the total CPU time is 6.09s.
-
startingpot = 'file': You find the following lines in the output file:The initial density is read from file : ./out/aiida.save/charge-density Starting wfcs are 8 randomized atomic wfcsThe calculation finishes in 1 SCF step, and the total CPU time is 11.99s.
-
startingwfc = 'file': You find the following lines in the output file:Initial potential from superposition of free atoms starting charge 7.99888, renormalised to 8.00000 Starting wfcs from file total cpu time spent up to now is 0.5 secs Self-consistent CalculationThe calculation finished in 7 SCF steps, and the total CPU time is 17.99s.
-
startingpot = 'file'ANDstartingwfc = 'file': Same asrestart_mode = 'restart'
So, it seems that:
-
setting
restart_modeto'restart'does restart both interrupted and completed runs properly. In this case both the wave functions and charge density are read from file, which gives a similar result as setting bothstartingpotand startingwfcto'file'`. -
For
startingpot = 'file', the calculation only requires 1 SCF step, but it takes longer because the initial wave function coefficients are random. Which is to be expected. -
For
startingwfc = 'file', it seems that the initial potential (or charge density) is still taken from the superposition of atomic charge densities. Because of this, the calculation still needs 7 SCF steps to converge and this restart method takes the longest. I suppose the question here is why the potential is not constructed from the wave functions before starting the SCF.
TL;DR
In the end, it seems that using restart_mode = 'restart' is indeed the way to go for restarting pw.x calculations for both interrupted and completed runs.
Thanks to @MackeEric for first drawing my attention to this! Also pinging @ramirezfranciscof and @qiaojunfeng in case they have comments.
So, I was a little rushed in running these tests this morning, and messed up the one for restart_mode = 'restart'. I've updated the OP to correct this.
@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting both startingpot = 'file' and startingwfc = 'file'.
A caveat to using restart_mode = 'restart' is in case e.g. the number of bands has been changed. Then trying to load the wave functions will fail:
The initial density is read from file :
./out/aiida.save/charge-density
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Error in routine pw_restart - read_collected_wfc (1):
The number of bands for this run is 12, but only 8 bands were read from file
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
In this case, it is interesting to use startingpot = 'file', since only 1 SCF will be needed. We should probably use this setting in the sanity_check_insufficient_bands error handler, which claims to restart at the moment:
https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L434-L441
But as far as I can see it doesn't 😅 , which is what is also reported:
https://github.com/aiidateam/aiida-quantumespresso/blob/9b083e78c56dd3a658e74018e8fe62772e22499c/aiida_quantumespresso/workflows/pw/base.py#L476
So we probably should still distinguish between different restart modes to deal with cases like this.
@MackeEric: didn't you mention during one of our meetings that QE doesn't properly restart from the charge density / wave functions when
restart_mode = 'restart'? Based my little test runs for Si, it seems QE v6.7 does properly restart also completed runs with this input by loading both the charge density and the wave functions. In fact, it seems almost identical to setting bothstartingpot = 'file'andstartingwfc = 'file'.
So in my experience, I have seen a somehow inconsistent behaviour, at least with QE v6.6. In a test scenario I designed on my own, I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.
However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run (even with nbnd and tot_magnetization set to the same values as outputted by the smearing run!). In principle, this should not cause any trouble but in reality it often does, yielding PW errors such as too many bands are not converged directly after the first iteration.
On the other hand, the very same restarts worked perfectly fine if I only used startingpot = 'file' while leaving restart_mode = 'restart'.
Thanks @MackeEric!
I also found that startingpot = 'file' and startingwfc = 'file' are basically equivalent to setting restart_mode = 'restart'.
One difference I realised while doing some more testing is that restart_mode = 'restart' also reads the structure from the previous calculation:
Atomic positions and unit cell read from directory:
./out/aiida.save/
Atomic positions from file used, from input discarded
So, when e.g. restarting from a previous relax calculation, using restart_mode = 'restart' just needs one SCF and ionic step, whereas setting startingpot = 'file' and startingwfc = 'file' will use of course use the initial structure.
However, I have frequently seen cases where QE crashed when I was restarting from a calulation where I had set occupations = 'smearing' and then used occupations = 'fixed' for the restarted run
Right, you weren't just restarting the same run, but adapting inputs. I can imagine using the wave functions from the 'smearing' run for a 'fixed' run causes troubles, since suddenly there will be states with partial occupations. I imagine similar errors will occur when using startingwfc = 'file' in this case. (Coindidentally, I did just try this for my simple Si SCF case and there I didn't run into any issues.)
So, using restart_mode = 'restart' is fine in case the run has completed, provided you don't change any calculation settings that cause trouble when loading the wave functions of the previous run.
Hi everyone,
This days I did find @MackeEric very same problem ("too many bands are not converged"), while trying to "restart" a pw calculation in the hubbard workflow. When you want to start from a density of an other pw calculation, but not use the wfcs (as in, eg, the hubbard workflows), it is then necessary to be able to start the calculation using restart_mode='from_scratch' and startingpot='file', since the "smeared" wfcs could be too much different.
Could a solution be having a restart_folder and a parent_folder?
The former is specifically used for restarting, while the latter is used to just copy the out folder (where one may want also to copy only specific part of it as an option for particular purposes? ).
Anyway, I do believe that a solution is needed, especially when I think about the hubbard workflow.
Thanks for the comment @bastonero! I indeed think a possible path forward could be keeping track of a restart_folder instead of a restart_calc and updating the logic where needed to use startingpot = file instead of restart_mode = 'restart'. I just haven't looked into this in much detail, so I'm not sure if it leads to issues somewhere.
I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join?
I have a meeting on Friday at 2pm with @timrov on discussing these restart issues. Maybe you also want to join? Yes, I think it would be great, thanks!
Notes from meeting with @ramirezfranciscof, @timrov, @MackeEric and @bastonero:
- We should consider only copying some files when restarting from a calculation. This should be fixable at the plugin level, and could help with reducing restart times when e.g. the wave functions are very big.
- Question: When you try
restart_mode = 'restart'and some of the outputs (e.g. wfc) are missing, does it crash the calculation? - Question: Is the indeed output structure is read from xml (is this the restart file the documentation mentions?) by default even if
restart_mode = 'from_scratch'? Seeion_positionsinput tag. - TODO: write document on all this for other QE users, and as a guideline for the discussion on making possible changes to QE (see below). Also check recent QE school for material on the topic.
Issues to raise on QE GitLab:
- The charge density (potential) should be constructed from the wave functions in case
startingwfc = 'file'. - Is there a problem with restarting with
restart_mode = 'restart'outside of two interrupt cases suggested in the documentation? For example, if you restart from a calculation that has hit the maximum number of ionic iterations, is this ok? - Change the way restarting works in QE:
restart_mode = 'restart'should not fail if something is missing. I.e. it should just check what outputs are there and use them to restart. (check question above)- The individual settings for restarting (
startingpot,startingwfcandion_positions) should override the outputs thatrestart_modeuses, not the other way around. - For consistency, it might actually be nice to rename
ion_positionsto e.g.startingionor something like that. Now it's not super clear that this is regarding restarting, and it would make it more in line withstartingpotandstartingwfc.
Pinging @giovannipizzi for comments, considering his considerable QE experience. 😁