moose
moose copied to clipboard
Wall time checkpoint
Automatic checkpoints based on wall time
Overview:
This pull request aims to enhance the checkpoint system. The changes encompass several key improvements, including introducing wall-time based checkpoints, standardizing checkpoint file structures, addressing logic bugs, refining parameter names for clarity, and optimizing logic for efficiency. Additionally, this PR includes comprehensive testing to ensure the reliability and functionality of the updated checkpoint system.
Changes Included:
-
Wall Time Checkpoints and Code Cleanup:
- Implemented automatic wall time checkpoints, allowing checkpoints to be written based on wall time intervals. This behavior can be modified through the new
Outputs/Checkpoint/wall_time_interval
input parameter. - Refactored common code from the child
Checkpoint
class into the parentOutput
class, enhancing code maintainability. - Renamed variables and deprecated input parameter names for improved clarity and consistency across tests.
interval
was renamed totime_step_interval
andminimum_time_interval
was renamed tomin_simulation_time_interval
. - Renamed the CLI parameter
--half-transient
to--test-checkpoint-half-transient
.
- Implemented automatic wall time checkpoints, allowing checkpoints to be written based on wall time intervals. This behavior can be modified through the new
-
Standardizing Checkpoint File Structure:
- Modified the signal-based checkpoint folder naming convention to align with the user-specified checkpoint folder naming convection. This change enhances consistency and simplifies management.
-
Logic Bug Patch:
- Patched a logic bug related to the update of
_last_output_time
(now renamed to_last_output_simulation_time
), ensuring the variable is updated when it is supposed to. - Patched a bug that accessed
FileOutput::_file_base
before it was set to it's final (correct) value. MOOSE is now able to find the correct checkpoint directory when recovering a simulation.
- Patched a logic bug related to the update of
-
Override Default Time Step Interval:
- Enabled the override of the default
time_step_interval
when thewall_time_interval
is specified but not thetime_step_interval
, ensuring flexibility in intended output frequency.
- Enabled the override of the default
-
Cleanup and Optimization:
- Removed the
delete_output_folder
parameter fromRunApp
as it was unused. - Optimized logic in
SignalTester
, particularly insend_signal
, to improve efficiency and reliability during testing. - Only one checkpoint may now be defined per multiapp level.
- Removed the
-
Test Coverage Enhancement:
- Added comprehensive test cases to validate the functionality of wall time checkpoints and to detect multiple checkpoint objects at a given multi-app level.
- Introduced a command-line parameter (
output-wall-time-interval
) and corresponding code to facilitate automatic wall-time checkpoint testing without explicitly defining checkpoints in the input file. - Renamed the --half-transient command line parameter to --test-checkpoint-half-transient.
Impact:
These changes significantly improve the robustness, flexibility, and maintainability of the checkpoint system. By addressing logic bugs, introducing new features, and refining existing functionalities, this PR lays the groundwork for a more efficient and reliable checkpoint system.
Related PRs:
Bison: https://github.inl.gov/ncrc/bison/pull/5803 (merged) Griffin: https://github.inl.gov/ncrc/griffin/pull/1723 (merged), https://github.inl.gov/ncrc/griffin/pull/1767
References:
- Closes idaholab#26588
Job Documentation on 6851424 wanted to post the following:
View the site here
This comment will be updated on new commits.
Job Coverage on 6851424 wanted to post the following:
Framework coverage
c7df31 | #26588 685142 | ||||
---|---|---|---|---|---|
Total | Total | +/- | New | ||
Rate | 85.29% | 85.30% | +0.01% | 100.00% | |
Hits | 100942 | 100985 | +43 | 108 | |
Misses | 17409 | 17408 | -1 | 0 |
Modules coverage
Optimization
c7df31 | #26588 685142 | ||||
---|---|---|---|---|---|
Total | Total | +/- | New | ||
Rate | 89.05% | 89.05% | - | 100.00% | |
Hits | 1943 | 1943 | - | 1 | |
Misses | 239 | 239 | - | 0 |
Full coverage reports
Reports
-
framework
-
chemical_reactions
-
combined
-
contact
-
electromagnetics
-
external_petsc_solver
-
fluid_properties
-
fsi
-
functional_expansion_tools
-
geochemistry
-
heat_transfer
-
level_set
-
misc
-
navier_stokes
-
optimization
-
peridynamics
-
phase_field
-
porous_flow
-
ray_tracing
-
rdg
-
reactor
-
richards
-
scalar_transport
-
solid_mechanics
-
solid_properties
-
stochastic_tools
-
thermal_hydraulics
-
xfem
This comment will be updated on new commits.
@milljm, @loganharbour:
The test harness currently has the ability to delete old output file before re-running tests. This helps to prevent false successes and is done through the delete_output_before_running
parameter. However, this only applies to files that are directly involved in determining whether a test passes (i.e., the exodus file in Exodiff tests, the check_files in CheckFiles tests, etc). Currently, there is no check for old checkpoint folders to delete them before running a test. I would like to incorporate a new boolean test harness parameter that, if set to true, looks for and deletes all *_cp directories in the test directory. I was talking to @permcody about what the default behavior should be. I think I am leaning towards having checkpoint folders deleted by default and making the user responsible for setting the new parameter to false for tests that require the presence of checkpoints for restart/recover. Thoughts?
Instead of a bool to delete wildcard data, can we instead do something where if the key is populated, the value becomes what gets deleted?
[test]
type = Exodiff
exodiff = 'recover_out.e'
delete_checkpoints = checkpoint_directory
delete_output_before_running = true
[]
will delete recover_out.e
and directories named checkpoint_directory
, but leave any directories/files named *_cp
alone.
I think that I would prefer instead that we keep a registry of files that are created during the run, and put that in a common output file that we can read. that way we don't need to keep adding things to delete to the harness, but simply need to read the result.
this would tie in with @dschwen's idea of keeping a pid file around. but with more info
Instead of a bool to delete wildcard data, can we instead do something where if the key is populated, the value becomes what gets deleted?
[test] type = Exodiff exodiff = 'recover_out.e' delete_checkpoints = checkpoint_directory delete_output_before_running = true []
will delete
recover_out.e
and directories namedcheckpoint_directory
, but leave any directories/files named*_cp
alone.
@milljm I had a similar idea. The only annoying part about it is that it would require changing A LOT of test files without an easy way to automate the change process (have to manually determine the cp folder name for each test, etc). However, the amount of pain required to implement the change does not mean that the change itself is a bad idea.
We are already creating a .previous_test_results.json
file, can we use that (for @dschwen work as well)? And if that's not possible, we should probably talk about all these 'registration files' that are populating information during a TestHarness run. Try to consolidate them...
I think that I would prefer instead that we keep a registry of files that are created during the run, and put that in a common output file that we can read. that way we don't need to keep adding things to delete to the harness, but simply need to read the result.
this would tie in with @dschwen's idea of keeping a pid file around. but with more info
@loganharbour I like this idea because of it's robustness. The implementation will probably be a separate PR, though. Tagging issue #9304.
Instead of a bool to delete wildcard data, can we instead do something where if the key is populated, the value becomes what gets deleted?
[test] type = Exodiff exodiff = 'recover_out.e' delete_checkpoints = checkpoint_directory delete_output_before_running = true []
will delete
recover_out.e
and directories namedcheckpoint_directory
, but leave any directories/files named*_cp
alone.@milljm I had a similar idea. The only annoying part about it is that it would require changing A LOT of test files without an easy way to automate the change process (have to manually determine the cp folder name for each test, etc). However, the amount of pain required to implement the change does not mean that the change itself is a bad idea.
My main concern is with the hard-coded wild card regular expression having an unintentional side effect of deleting data we did not intend.
We are already creating a
.previous_test_results.json
file, can we use that (for @dschwen work as well)? And if that's not possible, we should probably talk about all these 'registration files' that are populating information during a TestHarness run. Try to consolidate them...
No, this won't have anything to do with the test harness
We are already creating a
.previous_test_results.json
file, can we use that (for @dschwen work as well)? And if that's not possible, we should probably talk about all these 'registration files' that are populating information during a TestHarness run. Try to consolidate them...No, this won't have anything to do with the test harness
Sorry, no to what? We're not going to use .previous_test_results.json
, or moose_test.yaml
. That leaves us with Daniel's registry methods. I am fine with that. But won't this start to have things to do with the test harness? My concern is that that makes 3 different "registry files" that control TestHarness behavior.
@loganharbour, @permcody: This PR affects tests in Bison and Griffin (see links in PR description). I have draft PRs up for these apps and all tests are passing, but I have not yet updated deprecated parameter names there. I would like reviews and approvals here before doing that.
I think that I would prefer instead that we keep a registry of files that are created during the run, and put that in a common output file that we can read. that way we don't need to keep adding things to delete to the harness, but simply need to read the result.
How would we enforce this though? What's to stop somebody from opening a file in their custom UserObject and dumping out data? Sure this isn't a bad idea to capture most of what is produced by people using our API, but it still misses anyone going outside of that system, which may impact test results. We could always use the pedantic checker though to see if files are appearing outside of our interfaces...
Job Precheck on 8bcf005 wanted to post the following:
Your code requires style changes.
A patch was auto generated and copied here
You can directly apply the patch by running, in the top level of your repository:
curl -s https://mooseframework.inl.gov/docs/PRs/26588/clang_format/style.patch | git apply -v
Alternatively, with your repository up to date and in the top level of your repository:
git clang-format 796cb8aa68a15d61eb82eb353bb6582b01c6634f
@permcody, @loganharbour: Ready to merge?
@loganharbour, Griffin calls _app.halfTransient
, which has been renamed to _app.testCheckpointHalfTransient
. We need to break next to get this Griffin PR in: https://github.inl.gov/ncrc/griffin/pull/1767
@loganharbour, Griffin calls
_app.halfTransient
, which has been renamed to_app.testCheckpointHalfTransient
. We need to break next to get this Griffin PR in: https://github.inl.gov/ncrc/griffin/pull/1767
Never mind, I added a deprecation option for MooseApp::halfTransient, so we should be able to pass griffin tests now.
@loganharbour, it is ready.
Please resolve the conversations that you explicitly followed and made changes for. This looks good
Let's check with cody if --test-checkpoint-half-transient
is too long, but I think I'm fine with it.
Please resolve the conversations that you explicitly followed and made changes for. This looks good
Let's check with cody if
--test-checkpoint-half-transient
is too long, but I think I'm fine with it.
Per conversation with Cody, --test-checkpoint-half-transient name is fine.
Just need to rebase and deal with the conflicts. Good to go after.
Should be good now.
Any chance the changes here are responsible for the new libMesh MOOSE-test CI failures? E.g. https://civet.inl.gov/job/2127543/ - 3 of our recipes are giving "unknown dependency" failures from the non-skipped misc/signal_handler.test_signal_*recover*
tests.
only thing is common is this delete_output_before_running
parameter
I'll try to replicate the signal_handler
failure in a container on rod and report back.
I spoke with @milljm, and he says there really isn't a way to easily replicate the environment used in the CIVET job. He can comment more on the specifics.
I am looking a little more closely at this and I think this failing target was neglected to move over to use containers? I am seeing this target loading the following modules:
5) moose-petsc-0b2deac-gcc-9.4.0-4wj7ppk <aL>
6) moose-petsc/0b2deac <aL>
7) moose-petsc/libmesh_ci
8) python-3.7.8-gcc-9.4.0-g
and I am scratching my head as to how this works. I don't see PETSc being built, so we must be using some old version indeed.
I think this target needs to be switched over to use our containers.
The full list of checks that are failing with this error is:
Test 32bit
Test MOOSE min clang
Test debug
They are all checks that run on libmesh PRs, so they might have been overlooked.
I'm working on these slowly.