earth2studio icon indicating copy to clipboard operation
earth2studio copied to clipboard

Add checkpointing functionality to workflows

Open CldStlkr opened this issue 3 months ago • 0 comments

Add checkpointing functionality for workflow resume capability

Description

  • Implemented restart functionality for workflows.
  • While this adds significant functionality, much of the implementation follows consistent patterns across the three workflow types.
  • This implementation does not implement the IO pattern that @NickGeneva mentioned in the issue thread, but this should be a good base to expand upon.

Core Changes:

  1. New Checkpoint Module (earth2studio/utils/checkpoint.py)
    • save_checkpoint() - Saves simulation state, coordinates, RNG states
    • load_checkpoint() - Restores saved state with device handling
    • validate_checkpoint_compatibility() - Validates that checkpoint works with current model
    • should_checkpoint() - Decision logic for when to save
  2. Enhanced Workflows (earth2studio/run.py)
    • Added 3 optional parameters to all workflow functions:
      • checkpoint_path - Where to save/load checkpoints
      • checkpoint_interval - Save every N steps
      • resume_from_step Resume from specified step
    • Dual execution paths:
      • Normal: Uses existing iterators (exact same behavior)
      • Resume: Manual time-stepping to account for mid-simulation restart
  3. Comprehensive Testing (test/utils/test_checkpoint.py)
    • 25 tests covering save/load, validation, error handling
    • 90% code coverage on checkpoint utilities, can increase if needed
    • Tested CPU/CUDA compatibility edge cases

Notes

  • Zero breaking changes, works identically since checkpoint params are optional
  • Maintains reproducibility through RNG state preservation
  • Used PyTorch's save/load for file-based checkpointing
  • Prevents incompatible resumes via Coordinate System validation
  • Parameters are independent, and can be used flexibly (save-only, resume-only, or both)

This PR closes #446

Checklist

  • [ x] I am familiar with the Contributing Guidelines.
  • [ x] New or existing tests cover these changes.
  • [ x] The documentation is up to date with these changes. (Added docstring comments)
  • [ ] The CHANGELOG.md is up to date with these changes. (Will add in a new commit once changes have been reviewed)
  • [ x] An issue is linked to this pull request.

Dependencies

None - Uses existing PyTorch save/load functionality

CldStlkr avatar Sep 15 '25 00:09 CldStlkr