dynesty Making it easier to restart interrupted runs

This is issue is track/discuss the development of functionality that would allow an easier restarting of the dynesty sampler in ther case of the crash (timeout in HPC environment). (see i.e. #371) Also some docs may be useful to explain how to do that.

Possible ways of dealing with this include

Implement some sort of checkpointing that would allow restarts even when sampling with a simple run_nested
Still require the iterator based sampling for restarts, but make it easier to restart, i.e. implement the set_pool() method that would correctly set pool without having to do manually sampler.pool=X sampler.loglikehood.pool=X )

May 20 '22 10:05 segasai

This would be a great feature. Are there existing ways to continue an interrupted run (due to HPC timeout) on jobs without a pool? Specifically in the case where I've used the new HDF5 feature to save history...thanks!

May 23 '22 23:05 runburg

The simplest way is to use a generator https://dynesty.readthedocs.io/en/latest/quickstart.html?highlight=generator#running-externally while pickling periodically the sampler. The HDF5 history won't help here much though (as hdf5 history only saves the function evaluations, but not the state of the sampler).

May 23 '22 23:05 segasai

Ahhhh got it. Okay I will periodically pickle thank you!

May 23 '22 23:05 runburg

@runburg @segasai FWIW, in PyCBC we checkpoint dynesty by writing the pickle data to our hdf files; see line 371 here, which eventually calls this function to dump the pickle data to the hdf file. We use this in an HPC environment all the time. We're not using dynesty's hdf history functionality as this predates that (and we have our own format for hdf files anyway), but I would think you could do something similar to those files.

May 24 '22 15:05 cdcapano

For people observing this issue, I have a test implementation of the resuming interrupted runs in PR #386 . Any feedback on the implementation/interface would be helpful.

Aug 21 '22 12:08 segasai

The #386 has been merged, so this issue is resolved.

Sep 07 '22 10:09 segasai