earth2studio
earth2studio copied to clipboard
Add checkpointing functionality to workflows
Add checkpointing functionality for workflow resume capability
Description
- Implemented restart functionality for workflows.
- While this adds significant functionality, much of the implementation follows consistent patterns across the three workflow types.
- This implementation does not implement the IO pattern that @NickGeneva mentioned in the issue thread, but this should be a good base to expand upon.
Core Changes:
- New Checkpoint Module (earth2studio/utils/checkpoint.py)
- save_checkpoint() - Saves simulation state, coordinates, RNG states
- load_checkpoint() - Restores saved state with device handling
- validate_checkpoint_compatibility() - Validates that checkpoint works with current model
- should_checkpoint() - Decision logic for when to save
- Enhanced Workflows (earth2studio/run.py)
- Added 3 optional parameters to all workflow functions:
- checkpoint_path - Where to save/load checkpoints
- checkpoint_interval - Save every N steps
- resume_from_step Resume from specified step
- Dual execution paths:
- Normal: Uses existing iterators (exact same behavior)
- Resume: Manual time-stepping to account for mid-simulation restart
- Added 3 optional parameters to all workflow functions:
- Comprehensive Testing (test/utils/test_checkpoint.py)
- 25 tests covering save/load, validation, error handling
- 90% code coverage on checkpoint utilities, can increase if needed
- Tested CPU/CUDA compatibility edge cases
Notes
- Zero breaking changes, works identically since checkpoint params are optional
- Maintains reproducibility through RNG state preservation
- Used PyTorch's save/load for file-based checkpointing
- Prevents incompatible resumes via Coordinate System validation
- Parameters are independent, and can be used flexibly (save-only, resume-only, or both)
This PR closes #446
Checklist
- [ x] I am familiar with the Contributing Guidelines.
- [ x] New or existing tests cover these changes.
- [ x] The documentation is up to date with these changes. (Added docstring comments)
- [ ] The CHANGELOG.md is up to date with these changes. (Will add in a new commit once changes have been reviewed)
- [ x] An issue is linked to this pull request.
Dependencies
None - Uses existing PyTorch save/load functionality