Making it easier to restart interrupted runs
This is issue is track/discuss the development of functionality that would allow an easier restarting of the dynesty sampler in ther case of the crash (timeout in HPC environment). (see i.e. #371) Also some docs may be useful to explain how to do that.
Possible ways of dealing with this include
- Implement some sort of checkpointing that would allow restarts even when sampling with a simple run_nested
- Still require the iterator based sampling for restarts, but make it easier to restart, i.e. implement the set_pool() method that would correctly set pool without having to do manually sampler.pool=X sampler.loglikehood.pool=X )
This would be a great feature. Are there existing ways to continue an interrupted run (due to HPC timeout) on jobs without a pool? Specifically in the case where I've used the new HDF5 feature to save history...thanks!
The simplest way is to use a generator https://dynesty.readthedocs.io/en/latest/quickstart.html?highlight=generator#running-externally while pickling periodically the sampler. The HDF5 history won't help here much though (as hdf5 history only saves the function evaluations, but not the state of the sampler).
Ahhhh got it. Okay I will periodically pickle thank you!
@runburg @segasai FWIW, in PyCBC we checkpoint dynesty by writing the pickle data to our hdf files; see line 371 here, which eventually calls this function to dump the pickle data to the hdf file. We use this in an HPC environment all the time. We're not using dynesty's hdf history functionality as this predates that (and we have our own format for hdf files anyway), but I would think you could do something similar to those files.
For people observing this issue, I have a test implementation of the resuming interrupted runs in PR #386 . Any feedback on the implementation/interface would be helpful.
The #386 has been merged, so this issue is resolved.