quacc icon indicating copy to clipboard operation
quacc copied to clipboard

Make Quacc compatible with Parsl's checkpointing capabilities

Open tomdemeyere opened this issue 1 year ago • 8 comments

Summary of Changes

This PR aim to allow users to use the Parsl checkpointing features for simple jobs that do not require complex parameter types (most of them I believe). To this aim:

  • Added a function get_atoms_id_parsl which return the hash computed by _encode_atoms (previously named get_atoms_id, which still exist as well) in bytes form.
  • This function is automatically registered if Parsl is selected as the workflow engine, this is done in quacc.__init__.
  • I added a comprehensive documentation about the feature in docs.user.misc.restarts.

I am not sure how to deal with testing yet. From what I have seen on my HPC this seems to work ok.

Requirements

Note: If you are an external contributor, you will see a comment from @buildbot-princeton. This is solely for the maintainers.

tomdemeyere avatar Oct 31 '24 12:10 tomdemeyere

Can one of the admins verify this patch?

buildbot-princeton avatar Oct 31 '24 12:10 buildbot-princeton

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 97.32%. Comparing base (cf9e8db) to head (e9525f2). Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2521      +/-   ##
==========================================
- Coverage   97.39%   97.32%   -0.08%     
==========================================
  Files          85       85              
  Lines        3538     3550      +12     
==========================================
+ Hits         3446     3455       +9     
- Misses         92       95       +3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 31 '24 12:10 codecov[bot]

@tomdemeyere: Thank you for your contribution here! This looks quite nice. I do not have any particular comments other than that I would be pleased to merge this. Regarding testing, I guess you could run one of the tests/parsl test functions twice over? But in the short-term, a test to make sure that get_atoms_id_parsl is covered is probably sufficient. This should be trivial to implement. I can merge thereafter if you agree.

Andrew-S-Rosen avatar Oct 31 '24 23:10 Andrew-S-Rosen

@Andrew-S-Rosen I need to think about tests a little, it would be nice if we could check if this is working (grep something in the parsl runinfo ...).

This PR introduces restart (for one workflow engine) for jobs that are completed. So, if you, let's say, run a workflow that is a series of singlepoints, and you get timed out, the workflow will not recompute everything on restart for job(s) already completed: nice!

However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?

tomdemeyere avatar Nov 01 '24 21:11 tomdemeyere

However, the job(s) that failed, and did not complete will rerun from the beginning (no DFT-side restart possible), potentially wasting resources. This problem is not entirely fixable by workflow engines (I believe) since quacc has something to say in directory management. It would be nice to fix this issue at some point; what do you think?

Sort of, yes. Basically, even if the workflow engine supports native restarts (as many do), it will rerun the job from the beginning in a new directory. But if you were doing a geometry optimization and it timed out at 1000 steps, this means that you will have to start again rather than pick up where you left off and do step 1001. This is because the naive restart can be implemented entirely on the workflow engine side (where it belongs), but the workflow engine knows nothing about the science. It does not know how to restart a DFT calculation, which might involve shuttling files around for instance (e.g. in VASP, move CONTCAR to POSCAR) and can be dependent on the underlying method. What is the best way to solve this? It is not immediately clear to me...

Andrew-S-Rosen avatar Nov 01 '24 23:11 Andrew-S-Rosen

It is not immediately clear to me...

To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).

I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.

tomdemeyere avatar Nov 01 '24 23:11 tomdemeyere

To me as well... I think I understood your various requirements and expectations on this matter through our various interaction about it and might come up with an idea at some point (I hope).

Open to suggestions!

I often charge back on the restart aspect because I think this is an important matter that is often completely discarded from community codes. It would be nice to provide a solution so that users and groups can completely focus on the science without having to worry about things like this.

Agreed. It's a pretty annoying thing to have to deal with...

Andrew-S-Rosen avatar Nov 02 '24 01:11 Andrew-S-Rosen

Open to suggestions!

The most non-invasive way I see might be to perform a similar kind of hashing based on jobs parameters to replace the current naming from tmp-quacc-time-etc... to tmp-quacc-hash, so that same calculations always use the same temporary folder? Results dirs are then still unique. This has multiple good points:

  • The directory management is the only requirement being solved: restarts are then tackled where they belong, e.g., in calculators or custodian etc...
  • People don't have to modify their script for restarts, in fact they should not, they just run the same script again, and the workflow engine + quacc will make sure to restart exactly where needed.

The challenge will be for jobs that take non-trivial types in parameters (phonopy, ...).

tomdemeyere avatar Nov 02 '24 11:11 tomdemeyere

@Andrew-S-Rosen Here are the tests, including proper parsl restarting tests.

tomdemeyere avatar Nov 07 '24 14:11 tomdemeyere

Very nice! Thank you, @tomdemeyere!!

Andrew-S-Rosen avatar Nov 07 '24 15:11 Andrew-S-Rosen