dynesty icon indicating copy to clipboard operation
dynesty copied to clipboard

Making it easier to restart interrupted runs

Open segasai opened this issue 3 years ago • 4 comments

This is issue is track/discuss the development of functionality that would allow an easier restarting of the dynesty sampler in ther case of the crash (timeout in HPC environment). (see i.e. #371) Also some docs may be useful to explain how to do that.

Possible ways of dealing with this include

  • Implement some sort of checkpointing that would allow restarts even when sampling with a simple run_nested
  • Still require the iterator based sampling for restarts, but make it easier to restart, i.e. implement the set_pool() method that would correctly set pool without having to do manually sampler.pool=X sampler.loglikehood.pool=X )

segasai avatar May 20 '22 10:05 segasai

This would be a great feature. Are there existing ways to continue an interrupted run (due to HPC timeout) on jobs without a pool? Specifically in the case where I've used the new HDF5 feature to save history...thanks!

runburg avatar May 23 '22 23:05 runburg

The simplest way is to use a generator https://dynesty.readthedocs.io/en/latest/quickstart.html?highlight=generator#running-externally while pickling periodically the sampler. The HDF5 history won't help here much though (as hdf5 history only saves the function evaluations, but not the state of the sampler).

segasai avatar May 23 '22 23:05 segasai

Ahhhh got it. Okay I will periodically pickle thank you!

runburg avatar May 23 '22 23:05 runburg

@runburg @segasai FWIW, in PyCBC we checkpoint dynesty by writing the pickle data to our hdf files; see line 371 here, which eventually calls this function to dump the pickle data to the hdf file. We use this in an HPC environment all the time. We're not using dynesty's hdf history functionality as this predates that (and we have our own format for hdf files anyway), but I would think you could do something similar to those files.

cdcapano avatar May 24 '22 15:05 cdcapano

For people observing this issue, I have a test implementation of the resuming interrupted runs in PR #386 . Any feedback on the implementation/interface would be helpful.

segasai avatar Aug 21 '22 12:08 segasai

The #386 has been merged, so this issue is resolved.

segasai avatar Sep 07 '22 10:09 segasai