š[FEA]: Adding Restart Functionality?
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
Low (would be nice)
Please provide a clear description of problem you would like to solve.
For simulations that have a large memory footprint (e.g., high-resolution, large ensemble, and/or long lead time), it would be nice to have the functionality of saving and restarting simulations. Some users may appreciate this since GPU memory tends to be scarce.
Hi, is anyone currently working on this issue? If not, Iād like to take it on.
My rough plan is to add optional checkpointing to the workflows (ensemble, diagnostic, deterministic) in run.py. The user-facing API could look like:
save_path: Optional[str]ā location to store checkpointsresume_from: Optional[str]ā path to load from checkpointsave_every: Optional[int]ā how often to save (steps)
On each loop iteration, if step % save_every == 0, the workflow would write a checkpoint dict with:
- current step
- latest state (
x,coords) - model weights if applicable (for future training support)
If resume_from is provided, the workflow would load the checkpoint, resume that step, and continue.
Iād start with just run.py loops and expand to integrate with Zarr or other IO backends later if useful.
Does this approach align with what you had in mind?
@CldStlkr Thanks! The design will be helpful for VRAM-constrained users, especially if combined with improved flexibility of I/O and VRAM.
My use case involves periodical saving of simulation data and occasional restarting. My testing suggests frequent I/O (e.g., every time step) isn't good for performance. A better behavior may be letting the model run some steps, check if VRAM is in shortage, and save the simulation and restart data. The efficiency issue is secondary at this point though.
I agree. The MVP I will work in will start off with user-defined step intervals where sate will be saved, but that can be changed to be based on remaining VRAM down the line. Same with IO, where I can integrate with Zarr, etc. in the future.
I'll take some time this weekend to work on it and update this issue thread as needed.
We will be planning to add an example that shows how one could build their own workflow that checkpoints the models state that will allow one to restart inference from that state. Its in our tracking for this release cycle, but remains to be fully prioritized. So no timeline estimate at the moment.
The general pattern we are thinking of is to have two IO outputs, one that writes the outputs needed for the forecast, the other that dumps the whole prognostic state that can be used to then restart / continue as needed.