moose icon indicating copy to clipboard operation
moose copied to clipboard

Checkpoint enchancements

Open pbehne opened this issue 10 months ago • 17 comments

This PR provides enhancements to the existing automatic wall time checkpoint system. Specifically,

  • The Outputs/wall_time_checkpoint shortcut syntax is introduced as an easy way to turn off automatic checkpoint output.
  • An informative block is output to the MOOSE header describing the configuration of the checkpoint system for each simulation.
  • A new integration test is added to verify that subapps are not auto-checkpointed if not requested.
  • A new regression test is added that sets the checkpoint wall time interval to a value smaller than the time required to write a checkpoint to verify that no issues are encountered.
  • The default checkpoint wall time interval is adjusted from 10 minutes to 1 hour based on community feedback.
  • The checkpoint format is changed from binary (.cpr) to compressed ascii (.cpa.gz) so it will run on systems without XDR support.

Closes #27240, #27205, #26682

Related PRs to update apps: Griffin: https://github.inl.gov/ncrc/griffin/pull/1876 Bison: https://github.inl.gov/ncrc/bison/pull/5933

pbehne avatar Apr 03 '24 02:04 pbehne

Job Documentation on 6dcfe1f wanted to post the following:

View the site here

This comment will be updated on new commits.

moosebuild avatar Apr 03 '24 04:04 moosebuild

Job Coverage on 6dcfe1f wanted to post the following:

Framework coverage

68a347 #27268 6dcfe1
Total Total +/- New
Rate 85.06% 85.06% +0.00% 96.88%
Hits 104403 104423 +20 93
Misses 18335 18334 -1 3

Diff coverage report

Full coverage report

Modules coverage

Coverage did not change

Full coverage reports

Reports

This comment will be updated on new commits.

moosebuild avatar Apr 03 '24 18:04 moosebuild

Tag when needed for a review, silencing until then

loganharbour avatar Apr 11 '24 17:04 loganharbour

@loganharbour, ready for review. I'll work on fixes for the failing tests in Bison and Griffin in the meantime. Should be trivial.

pbehne avatar Apr 12 '24 19:04 pbehne

Not going all the way to .cpa.gz yet?

roystgnr avatar Apr 12 '24 19:04 roystgnr

Not going all the way to .cpa.gz yet?

please clarify?

loganharbour avatar Apr 12 '24 19:04 loganharbour

The missing context here was some discussion on Slack - https://moosedevelopers.slack.com/archives/C01054VRUEM/p1712693072807749?thread_ts=1711666351.143939&cid=C01054VRUEM

The summary is that IMHO there's no reason to prefer binary .cpr over ASCII+gzip'ed .cpa.gz - the latter is usually a little smaller, it's more compatible, and it's easier to debug with. But .cpr vs .cpa has tradeoffs in both directions; the .cpa files will generally be significantly larger. So right now I'm wondering if this PR (which on first skim is just switching to .cpa?) is one step in the .cpa.gz direction, or if there's some reason to avoid gzip that I hadn't thought about, or (most seriously) if Patrick tried gzip but something didn't work correctly.

roystgnr avatar Apr 12 '24 19:04 roystgnr

Are all walltime checkpoints kept? When I do manual checkpointing I usually only keep the last 2 checkpoints. For long running simulations the checkpoints can occupy massive amounts of disk space, especially if we move to uncompressed ASCII checkpoints. I would not want my simulations fail because they run out of disk space.

dschwen avatar Apr 12 '24 20:04 dschwen

By default, only the last 2 wall time checkpoints are kept, just like the default for other checkpoints. This can be changed through the same num_files parameter.

pbehne avatar Apr 15 '24 14:04 pbehne

Not going all the way to .cpa.gz yet?

I misunderstood our discussion on slack in that I thought that compression and decompression is performed automatically when using cpa format. What library should I use to do this, zlib?

pbehne avatar Apr 15 '24 15:04 pbehne

It's "automatic" but based on filename. Ask for a .cpa extension and you get plain ASCII; ask for .cpa.gz and you get gzipped ASCII.

roystgnr avatar Apr 15 '24 15:04 roystgnr

It's "automatic" but based on filename. Ask for a .cpa extension and you get plain ASCII; ask for .cpa.gz and you get gzipped ASCII.

@pbehne was this change made? This is pretty significant and I'm not sure it'll work 100% with all of the automated file searching we do when specifying cp directories

loganharbour avatar Apr 22 '24 15:04 loganharbour

It's "automatic" but based on filename. Ask for a .cpa extension and you get plain ASCII; ask for .cpa.gz and you get gzipped ASCII.

@pbehne was this change made? This is pretty significant and I'm not sure it'll work 100% with all of the automated file searching we do when specifying cp directories

Yes, the change has been made.

pbehne avatar Apr 22 '24 15:04 pbehne

Can you give examples of the header output? It's a bit more difficult to follow

loganharbour avatar Apr 24 '24 20:04 loganharbour

Can you give examples of the header output? It's a bit more difficult to follow

See attached for more examples.

outputs/checkpoint.interval/test_files: Current Time: Wed Apr 24 20:44:09 2024 outputs/checkpoint.interval/test_files: Executable Timestamp: Tue Apr 23 10:30:46 2024 outputs/checkpoint.interval/test_files: outputs/checkpoint.interval/test_files: Checkpoint: outputs/checkpoint.interval/test_files: Wall Time Interval: Every 3600.000000 s outputs/checkpoint.interval/test_files: User Checkpoint: Outputs/out outputs/checkpoint.interval/test_files: # Checkpoints Kept: 2 outputs/checkpoint.interval/test_files: Execute On: TIMESTEP_END outputs/checkpoint.interval/test_files: outputs/checkpoint.interval/test_files: Parallelism: outputs/checkpoint.interval/test_files: Num Processors: 1 outputs/checkpoint.interval/test_files: Num Threads: 1 outputs/checkpoint.interval/test_files: outputs/checkpoint.interval/test_files: Mesh: outputs/checkpoint.interval/test_files: Parallel Type: replicated outputs/checkpoint.interval/test_files: Mesh Dimension: 2 outputs/checkpoint.interval/test_files: Spatial Dimension: 2 outputs/checkpoint.interval/test_files: Nodes: 121 outputs/checkpoint.interval/test_files: Elems: 100 outputs/checkpoint.interval/test_files: Num Subdomains: 1

outputs/checkpoint.default/recover: Executable Timestamp: Tue Apr 23 10:30:46 2024 outputs/checkpoint.default/recover: outputs/checkpoint.default/recover: Checkpoint: outputs/checkpoint.default/recover: Wall Time Interval: Every 3600.000000 s outputs/checkpoint.default/recover: User Checkpoint: Disabled outputs/checkpoint.default/recover: # Checkpoints Kept: 2 outputs/checkpoint.default/recover: Execute On: TIMESTEP_END outputs/checkpoint.default/recover: outputs/checkpoint.default/recover: Parallelism: outputs/checkpoint.default/recover: Num Processors: 1 outputs/checkpoint.default/recover: Num Threads: 1

outputs/checkpoint.default/wall_time_interval: outputs/checkpoint.default/wall_time_interval: Checkpoint: outputs/checkpoint.default/wall_time_interval: Wall Time Interval: Every 0.020000 s outputs/checkpoint.default/wall_time_interval: User Checkpoint: Disabled outputs/checkpoint.default/wall_time_interval: # Checkpoints Kept: 2 outputs/checkpoint.default/wall_time_interval: Execute On: TIMESTEP_END outputs/checkpoint.default/wall_time_interval: outputs/checkpoint.default/wall_time_interval: Parallelism:

outputs/checkpoint.default/wall_time_interval_disabled: Current Time: Wed Apr 24 20:44:10 2024 outputs/checkpoint.default/wall_time_interval_disabled: Executable Timestamp: Tue Apr 23 10:30:46 2024 outputs/checkpoint.default/wall_time_interval_disabled: outputs/checkpoint.default/wall_time_interval_disabled: Checkpoint: outputs/checkpoint.default/wall_time_interval_disabled: Wall Time Interval: Disabled outputs/checkpoint.default/wall_time_interval_disabled: User Checkpoint: Disabled outputs/checkpoint.default/wall_time_interval_disabled: outputs/checkpoint.default/wall_time_interval_disabled: Parallelism: outputs/checkpoint.default/wall_time_interval_disabled: Num Processors: 1 outputs/checkpoint.default/wall_time_interval_disabled: Num Threads: 1

output.txt

pbehne avatar Apr 25 '24 02:04 pbehne

The copy and past into code blocks above really messes with the indent. I recommend downloading and cating output.txt.

pbehne avatar Apr 25 '24 02:04 pbehne

@loganharbour, I think we are good now. Bison is passing at https://github.inl.gov/ncrc/bison/pull/5933 and I'm currently working on Griffin.

pbehne avatar May 02 '24 15:05 pbehne

Based on my changes to address @roystgnr's review, I pushed the latest changes to my BISON and griffin PRs. These are now failing. I'll look more into them tomorrow to see what implications find vs rfind have.

pbehne avatar Jun 19 '24 03:06 pbehne

All good @roystgnr ?

loganharbour avatar Jun 19 '24 21:06 loganharbour

Okay, everything is ready on my end.

pbehne avatar Jun 19 '24 21:06 pbehne

Happy with the changes for the issues I caught.

roystgnr avatar Jun 24 '24 13:06 roystgnr

https://civet.inl.gov/job/2281545/ this is a valid failure I believe

loganharbour avatar Jun 24 '24 13:06 loganharbour

https://civet.inl.gov/job/2281545/ this is a valid failure I believe

"It runs on my machine" :)

I'll invalidate it, and if it keeps failing, investigate more.

pbehne avatar Jun 24 '24 14:06 pbehne

https://civet.inl.gov/job/2281545/ this is a valid failure I believe

"It runs on my machine" :)

I'll invalidate it, and if it keeps failing, investigate more.

Odds are you're missing a dependency in the tests. A lot of skips exist on the installed tests, which leads to a significantly different dependency tree. Regardless, invalidating over and over isn't the solution. Unfortunately you need to actually dig into it.

loganharbour avatar Jun 24 '24 14:06 loganharbour

@loganharbour, I think we're finally good. I updated my bison and griffin PRs to reflect the latest changes, and those are both passing.

pbehne avatar Jun 26 '24 16:06 pbehne

I still worry about about foo.bar.e.gz sorts of filenames, but I don't think there's any way to disambiguate "this is the extension" vs "this is just a period in the middle of the filename" without using a predefined set of allowed extensions.

We can actually do something akin to the file command in linux to get more information about the file, but that's for another day.

loganharbour avatar Jun 26 '24 17:06 loganharbour

@lindsayad NEAMS MP project

pbehne avatar Jul 08 '24 15:07 pbehne