openfe
openfe copied to clipboard
Checkpoint restoration
Dear OpenFE team developers,
Thanks for the great work on OpenFE. I have noted that, while the package saves checkpoints by default, there are no indications in the documentation (nor was I able to find instructions on the code) on how to resume an interrupted job from the checkpoint. Am I missing something?
If I am correct that there is no standard checkpointing restoration system, I suspect it wouldn't be too hard to implement one. The samplers could use the from_storage methods of the corresponding OpenMMTools samplers when a checkpoint is found. To make it more seamless, one could also modify the hash generation so they are uniformly generated (i.e. if, when rerunning the quickrun command, the generated shared/scratch folders were identical). Are there any plans to do something like this in the future, or can I have a go?
Best wishes, Carlos
@couteiral
Restarts are something the OpenFE team is looking at. Unfortunately the solution is a lot more complicated than this. Part of the issue is that "hash generation" isn't a straight forward thing, it relies (for good reasons) on a complex mechanism.
The team is currently undergoing a few changes but we are hoping fix this a bit later in the year.
Hi @IAlibay,
Thanks for the quick reply. Sounds good, glad to hear this is in the roadmap -- is there any way I can help (e.g. implementing the restarting logic in the ProtocolUnit.run() methods)?
@couteiral the offer is appreciated and I will need to check in with the team. I do suspect that we might have to say "no for now", this will need to be done in a very specific manner (especially as we are looking to move things to FEFlow), and I'm not sure we have the staff capacity right now - this will likely change in a ~ month though.