Make Quacc compatible with Parsl's checkpointing capabilities
Summary of Changes
This PR aim to allow users to use the Parsl checkpointing features for simple jobs that do not require complex parameter types (most of them I believe). To this aim:
- Added a function
get_atoms_id_parslwhich return the hash computed by_encode_atoms(previously namedget_atoms_id, which still exist as well) in bytes form. - This function is automatically registered if Parsl is selected as the workflow engine, this is done in
quacc.__init__. - I added a comprehensive documentation about the feature in
docs.user.misc.restarts.
I am not sure how to deal with testing yet. From what I have seen on my HPC this seems to work ok.
Requirements
- [x] My PR is focused on a single feature addition or bugfix.
- [ ] My PR has relevant, comprehensive unit tests.
- [x] My PR is on a custom branch (i.e. is not named
main).
Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.
Can one of the admins verify this patch?
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 97.32%. Comparing base (
cf9e8db) to head (e9525f2). Report is 13 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #2521 +/- ##
==========================================
- Coverage 97.39% 97.32% -0.08%
==========================================
Files 85 85
Lines 3538 3550 +12
==========================================
+ Hits 3446 3455 +9
- Misses 92 95 +3
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@tomdemeyere: Thank you for your contribution here! This looks quite nice. I do not have any particular comments other than that I would be pleased to merge this. Regarding testing, I guess you could run one of the tests/parsl test functions twice over? But in the short-term, a test to make sure that get_atoms_id_parsl is covered is probably sufficient. This should be trivial to implement. I can merge thereafter if you agree.
@Andrew-S-Rosen I need to think about tests a little, it would be nice if we could check if this is working (grep something in the parsl runinfo ...).
This PR introduces restart (for one workflow engine) for jobs that are completed. So, if you, let's say, run a workflow that is a series of singlepoints, and you get timed out, the workflow will not recompute everything on restart for job(s) already completed: nice!
However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?
However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?
Sort of, yes. Basically, even if the workflow engine supports native restarts (as many do), it will rerun the job from the beginning in a new directory. But if you were doing a geometry optimization and it timed out at 1000 steps, this means that you will have to start again rather than pick up where you left off and do step 1001. This is because the naive restart can be implemented entirely on the workflow engine side (where it belongs), but the workflow engine knows nothing about the science. It does not know how to restart a DFT calculation, which might involve shuttling files around for instance (e.g. in VASP, move CONTCAR to POSCAR) and can be dependent on the underlying method. What is the best way to solve this? It is not immediately clear to me...
It is not immediately clear to me...
To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).
I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.
To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).
Open to suggestions!
I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.
Agreed. It's a pretty annoying thing to have to deal with...
Open to suggestions!
The most non-invasive way I see might be to perform a similar kind of hashing based on jobs parameters to replace the current naming from tmp-quacc-time-etc... to tmp-quacc-hash, so that same calculations always use the same temporary folder? Results dirs are then still unique. This has multiple good points:
- The directory management is the only requirement being solved: restarts are then tackled where they belong, e.g., in calculators or custodian etc...
- People don't have to modify their script for restarts, in fact they should not, they just run the same script again, and the workflow engine + quacc will make sure to restart exactly where needed.
The challenge will be for jobs that take non-trivial types in parameters (phonopy, ...).
@Andrew-S-Rosen Here are the tests, including proper parsl restarting tests.
Very nice! Thank you, @tomdemeyere!!