ClimaOcean.jl icon indicating copy to clipboard operation
ClimaOcean.jl copied to clipboard

Checkpointer not supported for coupled ClimaOcean workflows

Open taimoorsohail opened this issue 1 year ago • 4 comments

I am trying to implement a checkpointer in the one_degree_simulation.jl to be able to run longer simulations. My implementation is:

output_dir = "/g/data/v46/txs156/ClimaOcean.jl/examples/"
prefix = "one_deg_tripolar_checkpoint"

ocean.output_writers[:checkpoint] = Checkpointer(ocean.model;
                                                  schedule = TimeInterval(1days),
                                                  prefix = prefix,
                                                  cleanup = true,
                                                  dir = output_dir,
                                                  verbose = true,
                                                  overwrite_existing = true)

# We check if a checkpointer already exists - if not, we can run the initial start up

pattern = prefix * "*"
checkpoint_file = glob(pattern, output_dir)

if !isempty(checkpoint_file)
    # If checkpoint exists, load the simulation state
    println("Checkpoint found, resuming the simulation from the checkpoint.")
    simulation.Δt = 20minutes
    simulation.stop_time = 360days

    run!(simulation, pickup=true)
else
    print("Checkpoint not found, spinning up simulation from scratch.")
    run!(simulation)
    simulation.Δt = 20minutes
    simulation.stop_time = 360days

    run!(simulation)
end

However, the run!(simulation, pickup=true) line does not work, giving the error

ERROR: LoadError: No checkpointers found: cannot pickup simulation!

Even though a checkpointer with the name one_deg_tripolar_checkpoint_iteration1656.jld2 was saved successfully in the output_dir (which is the same as "." in this case). The docs suggest the pickup=true line should look in the directory for checkpoints, but it doesn't appear to be.

taimoorsohail avatar Mar 05 '25 00:03 taimoorsohail

Hmm right, I don't think pickup=true works with ClimaOcean yet. What you can do now is manually restore the state. We have to think about how this should work. Somehow when we are picking up, the coupled model needs to know to look for checkpoints for all of its components?

glwagner avatar Mar 05 '25 00:03 glwagner

I think we need to design a Checkpointer for the coupled simulation which checkpoints all component models at the same time.

Until then one can use a Checkpointer just for one component (like the ocean) in this way; for this you need to use JLD2 to restore from a checkpoint by opening the checkpoint file and loading the data by hand. I can prototype this workflow and come up with some sample code (if you figure it out @taimoorsohail please post here!)

glwagner avatar Mar 05 '25 00:03 glwagner

Thanks!

taimoorsohail avatar Mar 05 '25 00:03 taimoorsohail

noting that this seems similar/duplicate to #303

navidcy avatar Mar 12 '25 21:03 navidcy