moose
moose copied to clipboard
Checkpoint enchancements
This PR provides enhancements to the existing automatic wall time checkpoint system. Specifically,
- The
Outputs/wall_time_checkpoint
shortcut syntax is introduced as an easy way to turn off automatic checkpoint output. - An informative block is output to the MOOSE header describing the configuration of the checkpoint system for each simulation.
- A new integration test is added to verify that subapps are not auto-checkpointed if not requested.
- A new regression test is added that sets the checkpoint wall time interval to a value smaller than the time required to write a checkpoint to verify that no issues are encountered.
- The default checkpoint wall time interval is adjusted from 10 minutes to 1 hour based on community feedback.
- The checkpoint format is changed from binary (.cpr) to compressed ascii (.cpa.gz) so it will run on systems without XDR support.
Closes #27240, #27205, #26682
Related PRs to update apps: Griffin: https://github.inl.gov/ncrc/griffin/pull/1876 Bison: https://github.inl.gov/ncrc/bison/pull/5933
Job Documentation on 6dcfe1f wanted to post the following:
View the site here
This comment will be updated on new commits.
Job Coverage on 6dcfe1f wanted to post the following:
Framework coverage
68a347 | #27268 6dcfe1 | ||||
---|---|---|---|---|---|
Total | Total | +/- | New | ||
Rate | 85.06% | 85.06% | +0.00% | 96.88% | |
Hits | 104403 | 104423 | +20 | 93 | |
Misses | 18335 | 18334 | -1 | 3 |
Modules coverage
Coverage did not change
Full coverage reports
Reports
-
framework
-
chemical_reactions
-
combined
-
contact
-
electromagnetics
-
external_petsc_solver
-
fluid_properties
-
fsi
-
functional_expansion_tools
-
geochemistry
-
heat_transfer
-
level_set
-
misc
-
navier_stokes
-
optimization
-
peridynamics
-
phase_field
-
porous_flow
-
ray_tracing
-
rdg
-
reactor
-
richards
-
scalar_transport
-
solid_mechanics
-
solid_properties
-
stochastic_tools
-
thermal_hydraulics
-
xfem
This comment will be updated on new commits.
Tag when needed for a review, silencing until then
@loganharbour, ready for review. I'll work on fixes for the failing tests in Bison and Griffin in the meantime. Should be trivial.
Not going all the way to .cpa.gz yet?
Not going all the way to .cpa.gz yet?
please clarify?
The missing context here was some discussion on Slack - https://moosedevelopers.slack.com/archives/C01054VRUEM/p1712693072807749?thread_ts=1711666351.143939&cid=C01054VRUEM
The summary is that IMHO there's no reason to prefer binary .cpr over ASCII+gzip'ed .cpa.gz - the latter is usually a little smaller, it's more compatible, and it's easier to debug with. But .cpr vs .cpa has tradeoffs in both directions; the .cpa files will generally be significantly larger. So right now I'm wondering if this PR (which on first skim is just switching to .cpa?) is one step in the .cpa.gz direction, or if there's some reason to avoid gzip that I hadn't thought about, or (most seriously) if Patrick tried gzip but something didn't work correctly.
Are all walltime checkpoints kept? When I do manual checkpointing I usually only keep the last 2 checkpoints. For long running simulations the checkpoints can occupy massive amounts of disk space, especially if we move to uncompressed ASCII checkpoints. I would not want my simulations fail because they run out of disk space.
By default, only the last 2 wall time checkpoints are kept, just like the default for other checkpoints. This can be changed through the same num_files
parameter.
Not going all the way to .cpa.gz yet?
I misunderstood our discussion on slack in that I thought that compression and decompression is performed automatically when using cpa format. What library should I use to do this, zlib?
It's "automatic" but based on filename. Ask for a .cpa
extension and you get plain ASCII; ask for .cpa.gz
and you get gzipped ASCII.
It's "automatic" but based on filename. Ask for a
.cpa
extension and you get plain ASCII; ask for.cpa.gz
and you get gzipped ASCII.
@pbehne was this change made? This is pretty significant and I'm not sure it'll work 100% with all of the automated file searching we do when specifying cp directories
It's "automatic" but based on filename. Ask for a
.cpa
extension and you get plain ASCII; ask for.cpa.gz
and you get gzipped ASCII.@pbehne was this change made? This is pretty significant and I'm not sure it'll work 100% with all of the automated file searching we do when specifying cp directories
Yes, the change has been made.
Can you give examples of the header output? It's a bit more difficult to follow
Can you give examples of the header output? It's a bit more difficult to follow
See attached for more examples.
outputs/checkpoint.interval/test_files: Current Time: Wed Apr 24 20:44:09 2024 outputs/checkpoint.interval/test_files: Executable Timestamp: Tue Apr 23 10:30:46 2024 outputs/checkpoint.interval/test_files: outputs/checkpoint.interval/test_files: Checkpoint: outputs/checkpoint.interval/test_files: Wall Time Interval: Every 3600.000000 s outputs/checkpoint.interval/test_files: User Checkpoint: Outputs/out outputs/checkpoint.interval/test_files: # Checkpoints Kept: 2 outputs/checkpoint.interval/test_files: Execute On: TIMESTEP_END outputs/checkpoint.interval/test_files: outputs/checkpoint.interval/test_files: Parallelism: outputs/checkpoint.interval/test_files: Num Processors: 1 outputs/checkpoint.interval/test_files: Num Threads: 1 outputs/checkpoint.interval/test_files: outputs/checkpoint.interval/test_files: Mesh: outputs/checkpoint.interval/test_files: Parallel Type: replicated outputs/checkpoint.interval/test_files: Mesh Dimension: 2 outputs/checkpoint.interval/test_files: Spatial Dimension: 2 outputs/checkpoint.interval/test_files: Nodes: 121 outputs/checkpoint.interval/test_files: Elems: 100 outputs/checkpoint.interval/test_files: Num Subdomains: 1
outputs/checkpoint.default/recover: Executable Timestamp: Tue Apr 23 10:30:46 2024 outputs/checkpoint.default/recover: outputs/checkpoint.default/recover: Checkpoint: outputs/checkpoint.default/recover: Wall Time Interval: Every 3600.000000 s outputs/checkpoint.default/recover: User Checkpoint: Disabled outputs/checkpoint.default/recover: # Checkpoints Kept: 2 outputs/checkpoint.default/recover: Execute On: TIMESTEP_END outputs/checkpoint.default/recover: outputs/checkpoint.default/recover: Parallelism: outputs/checkpoint.default/recover: Num Processors: 1 outputs/checkpoint.default/recover: Num Threads: 1
outputs/checkpoint.default/wall_time_interval: outputs/checkpoint.default/wall_time_interval: Checkpoint: outputs/checkpoint.default/wall_time_interval: Wall Time Interval: Every 0.020000 s outputs/checkpoint.default/wall_time_interval: User Checkpoint: Disabled outputs/checkpoint.default/wall_time_interval: # Checkpoints Kept: 2 outputs/checkpoint.default/wall_time_interval: Execute On: TIMESTEP_END outputs/checkpoint.default/wall_time_interval: outputs/checkpoint.default/wall_time_interval: Parallelism:
outputs/checkpoint.default/wall_time_interval_disabled: Current Time: Wed Apr 24 20:44:10 2024 outputs/checkpoint.default/wall_time_interval_disabled: Executable Timestamp: Tue Apr 23 10:30:46 2024 outputs/checkpoint.default/wall_time_interval_disabled: outputs/checkpoint.default/wall_time_interval_disabled: Checkpoint: outputs/checkpoint.default/wall_time_interval_disabled: Wall Time Interval: Disabled outputs/checkpoint.default/wall_time_interval_disabled: User Checkpoint: Disabled outputs/checkpoint.default/wall_time_interval_disabled: outputs/checkpoint.default/wall_time_interval_disabled: Parallelism: outputs/checkpoint.default/wall_time_interval_disabled: Num Processors: 1 outputs/checkpoint.default/wall_time_interval_disabled: Num Threads: 1
The copy and past into code blocks above really messes with the indent. I recommend downloading and cating output.txt
.
@loganharbour, I think we are good now. Bison is passing at https://github.inl.gov/ncrc/bison/pull/5933 and I'm currently working on Griffin.
Based on my changes to address @roystgnr's review, I pushed the latest changes to my BISON and griffin PRs. These are now failing. I'll look more into them tomorrow to see what implications find
vs rfind
have.
All good @roystgnr ?
Okay, everything is ready on my end.
Happy with the changes for the issues I caught.
https://civet.inl.gov/job/2281545/ this is a valid failure I believe
https://civet.inl.gov/job/2281545/ this is a valid failure I believe
"It runs on my machine" :)
I'll invalidate it, and if it keeps failing, investigate more.
https://civet.inl.gov/job/2281545/ this is a valid failure I believe
"It runs on my machine" :)
I'll invalidate it, and if it keeps failing, investigate more.
Odds are you're missing a dependency in the tests. A lot of skips exist on the installed tests, which leads to a significantly different dependency tree. Regardless, invalidating over and over isn't the solution. Unfortunately you need to actually dig into it.
@loganharbour, I think we're finally good. I updated my bison and griffin PRs to reflect the latest changes, and those are both passing.
I still worry about about foo.bar.e.gz sorts of filenames, but I don't think there's any way to disambiguate "this is the extension" vs "this is just a period in the middle of the filename" without using a predefined set of allowed extensions.
We can actually do something akin to the file
command in linux to get more information about the file, but that's for another day.
@lindsayad NEAMS MP project