moose icon indicating copy to clipboard operation
moose copied to clipboard

Wall time checkpoint

Open pbehne opened this issue 1 year ago • 22 comments

Automatic checkpoints based on wall time

Overview:

This pull request aims to enhance the checkpoint system. The changes encompass several key improvements, including introducing wall-time based checkpoints, standardizing checkpoint file structures, addressing logic bugs, refining parameter names for clarity, and optimizing logic for efficiency. Additionally, this PR includes comprehensive testing to ensure the reliability and functionality of the updated checkpoint system.

Changes Included:

  1. Wall Time Checkpoints and Code Cleanup:

    • Implemented automatic wall time checkpoints, allowing checkpoints to be written based on wall time intervals. This behavior can be modified through the new Outputs/Checkpoint/wall_time_interval input parameter.
    • Refactored common code from the child Checkpoint class into the parent Output class, enhancing code maintainability.
    • Renamed variables and deprecated input parameter names for improved clarity and consistency across tests. interval was renamed to time_step_interval and minimum_time_interval was renamed to min_simulation_time_interval.
    • Renamed the CLI parameter --half-transient to --test-checkpoint-half-transient.
  2. Standardizing Checkpoint File Structure:

    • Modified the signal-based checkpoint folder naming convention to align with the user-specified checkpoint folder naming convection. This change enhances consistency and simplifies management.
  3. Logic Bug Patch:

    • Patched a logic bug related to the update of _last_output_time (now renamed to _last_output_simulation_time), ensuring the variable is updated when it is supposed to.
    • Patched a bug that accessed FileOutput::_file_base before it was set to it's final (correct) value. MOOSE is now able to find the correct checkpoint directory when recovering a simulation.
  4. Override Default Time Step Interval:

    • Enabled the override of the default time_step_interval when the wall_time_interval is specified but not the time_step_interval, ensuring flexibility in intended output frequency.
  5. Cleanup and Optimization:

    • Removed the delete_output_folder parameter from RunApp as it was unused.
    • Optimized logic in SignalTester, particularly in send_signal, to improve efficiency and reliability during testing.
    • Only one checkpoint may now be defined per multiapp level.
  6. Test Coverage Enhancement:

    • Added comprehensive test cases to validate the functionality of wall time checkpoints and to detect multiple checkpoint objects at a given multi-app level.
    • Introduced a command-line parameter (output-wall-time-interval) and corresponding code to facilitate automatic wall-time checkpoint testing without explicitly defining checkpoints in the input file.
    • Renamed the --half-transient command line parameter to --test-checkpoint-half-transient.

Impact:

These changes significantly improve the robustness, flexibility, and maintainability of the checkpoint system. By addressing logic bugs, introducing new features, and refining existing functionalities, this PR lays the groundwork for a more efficient and reliable checkpoint system.

Related PRs:

Bison: https://github.inl.gov/ncrc/bison/pull/5803 (merged) Griffin: https://github.inl.gov/ncrc/griffin/pull/1723 (merged), https://github.inl.gov/ncrc/griffin/pull/1767

References:

  • Closes idaholab#26588

pbehne avatar Jan 18 '24 16:01 pbehne

Job Documentation on 6851424 wanted to post the following:

View the site here

This comment will be updated on new commits.

moosebuild avatar Jan 18 '24 19:01 moosebuild

Job Coverage on 6851424 wanted to post the following:

Framework coverage

c7df31 #26588 685142
Total Total +/- New
Rate 85.29% 85.30% +0.01% 100.00%
Hits 100942 100985 +43 108
Misses 17409 17408 -1 0

Diff coverage report

Full coverage report

Modules coverage

Optimization

c7df31 #26588 685142
Total Total +/- New
Rate 89.05% 89.05% - 100.00%
Hits 1943 1943 - 1
Misses 239 239 - 0

Diff coverage report

Full coverage report

Full coverage reports

Reports

This comment will be updated on new commits.

moosebuild avatar Jan 19 '24 02:01 moosebuild

@milljm, @loganharbour:

The test harness currently has the ability to delete old output file before re-running tests. This helps to prevent false successes and is done through the delete_output_before_running parameter. However, this only applies to files that are directly involved in determining whether a test passes (i.e., the exodus file in Exodiff tests, the check_files in CheckFiles tests, etc). Currently, there is no check for old checkpoint folders to delete them before running a test. I would like to incorporate a new boolean test harness parameter that, if set to true, looks for and deletes all *_cp directories in the test directory. I was talking to @permcody about what the default behavior should be. I think I am leaning towards having checkpoint folders deleted by default and making the user responsible for setting the new parameter to false for tests that require the presence of checkpoints for restart/recover. Thoughts?

pbehne avatar Jan 23 '24 16:01 pbehne

Instead of a bool to delete wildcard data, can we instead do something where if the key is populated, the value becomes what gets deleted?

[test]
     type = Exodiff
     exodiff = 'recover_out.e'
     delete_checkpoints = checkpoint_directory
     delete_output_before_running = true
[]

will delete recover_out.e and directories named checkpoint_directory, but leave any directories/files named *_cp alone.

milljm avatar Jan 23 '24 16:01 milljm

I think that I would prefer instead that we keep a registry of files that are created during the run, and put that in a common output file that we can read. that way we don't need to keep adding things to delete to the harness, but simply need to read the result.

this would tie in with @dschwen's idea of keeping a pid file around. but with more info

loganharbour avatar Jan 23 '24 16:01 loganharbour

Instead of a bool to delete wildcard data, can we instead do something where if the key is populated, the value becomes what gets deleted?

[test]
     type = Exodiff
     exodiff = 'recover_out.e'
     delete_checkpoints = checkpoint_directory
     delete_output_before_running = true
[]

will delete recover_out.e and directories named checkpoint_directory, but leave any directories/files named *_cp alone.

@milljm I had a similar idea. The only annoying part about it is that it would require changing A LOT of test files without an easy way to automate the change process (have to manually determine the cp folder name for each test, etc). However, the amount of pain required to implement the change does not mean that the change itself is a bad idea.

pbehne avatar Jan 23 '24 16:01 pbehne

We are already creating a .previous_test_results.json file, can we use that (for @dschwen work as well)? And if that's not possible, we should probably talk about all these 'registration files' that are populating information during a TestHarness run. Try to consolidate them...

milljm avatar Jan 23 '24 16:01 milljm

I think that I would prefer instead that we keep a registry of files that are created during the run, and put that in a common output file that we can read. that way we don't need to keep adding things to delete to the harness, but simply need to read the result.

this would tie in with @dschwen's idea of keeping a pid file around. but with more info

@loganharbour I like this idea because of it's robustness. The implementation will probably be a separate PR, though. Tagging issue #9304.

pbehne avatar Jan 23 '24 16:01 pbehne

Instead of a bool to delete wildcard data, can we instead do something where if the key is populated, the value becomes what gets deleted?

[test]
     type = Exodiff
     exodiff = 'recover_out.e'
     delete_checkpoints = checkpoint_directory
     delete_output_before_running = true
[]

will delete recover_out.e and directories named checkpoint_directory, but leave any directories/files named *_cp alone.

@milljm I had a similar idea. The only annoying part about it is that it would require changing A LOT of test files without an easy way to automate the change process (have to manually determine the cp folder name for each test, etc). However, the amount of pain required to implement the change does not mean that the change itself is a bad idea.

My main concern is with the hard-coded wild card regular expression having an unintentional side effect of deleting data we did not intend.

milljm avatar Jan 23 '24 16:01 milljm

We are already creating a .previous_test_results.json file, can we use that (for @dschwen work as well)? And if that's not possible, we should probably talk about all these 'registration files' that are populating information during a TestHarness run. Try to consolidate them...

No, this won't have anything to do with the test harness

loganharbour avatar Jan 23 '24 17:01 loganharbour

We are already creating a .previous_test_results.json file, can we use that (for @dschwen work as well)? And if that's not possible, we should probably talk about all these 'registration files' that are populating information during a TestHarness run. Try to consolidate them...

No, this won't have anything to do with the test harness

Sorry, no to what? We're not going to use .previous_test_results.json, or moose_test.yaml. That leaves us with Daniel's registry methods. I am fine with that. But won't this start to have things to do with the test harness? My concern is that that makes 3 different "registry files" that control TestHarness behavior.

milljm avatar Jan 23 '24 19:01 milljm

@loganharbour, @permcody: This PR affects tests in Bison and Griffin (see links in PR description). I have draft PRs up for these apps and all tests are passing, but I have not yet updated deprecated parameter names there. I would like reviews and approvals here before doing that.

pbehne avatar Jan 29 '24 19:01 pbehne

I think that I would prefer instead that we keep a registry of files that are created during the run, and put that in a common output file that we can read. that way we don't need to keep adding things to delete to the harness, but simply need to read the result.

How would we enforce this though? What's to stop somebody from opening a file in their custom UserObject and dumping out data? Sure this isn't a bad idea to capture most of what is produced by people using our API, but it still misses anyone going outside of that system, which may impact test results. We could always use the pedantic checker though to see if files are appearing outside of our interfaces...

permcody avatar Jan 29 '24 20:01 permcody

Job Precheck on 8bcf005 wanted to post the following:

Your code requires style changes.

A patch was auto generated and copied here
You can directly apply the patch by running, in the top level of your repository:

curl -s https://mooseframework.inl.gov/docs/PRs/26588/clang_format/style.patch | git apply -v

Alternatively, with your repository up to date and in the top level of your repository:

git clang-format 796cb8aa68a15d61eb82eb353bb6582b01c6634f

moosebuild avatar Jan 29 '24 22:01 moosebuild

@permcody, @loganharbour: Ready to merge?

pbehne avatar Feb 02 '24 19:02 pbehne

@loganharbour, Griffin calls _app.halfTransient, which has been renamed to _app.testCheckpointHalfTransient. We need to break next to get this Griffin PR in: https://github.inl.gov/ncrc/griffin/pull/1767

pbehne avatar Feb 14 '24 15:02 pbehne

@loganharbour, Griffin calls _app.halfTransient, which has been renamed to _app.testCheckpointHalfTransient. We need to break next to get this Griffin PR in: https://github.inl.gov/ncrc/griffin/pull/1767

Never mind, I added a deprecation option for MooseApp::halfTransient, so we should be able to pass griffin tests now.

pbehne avatar Feb 19 '24 16:02 pbehne

@loganharbour, it is ready.

pbehne avatar Feb 19 '24 19:02 pbehne

Please resolve the conversations that you explicitly followed and made changes for. This looks good

Let's check with cody if --test-checkpoint-half-transient is too long, but I think I'm fine with it.

loganharbour avatar Feb 25 '24 22:02 loganharbour

Please resolve the conversations that you explicitly followed and made changes for. This looks good

Let's check with cody if --test-checkpoint-half-transient is too long, but I think I'm fine with it.

Per conversation with Cody, --test-checkpoint-half-transient name is fine.

pbehne avatar Feb 26 '24 15:02 pbehne

Just need to rebase and deal with the conflicts. Good to go after.

loganharbour avatar Mar 05 '24 19:03 loganharbour

Should be good now.

pbehne avatar Mar 11 '24 02:03 pbehne

Any chance the changes here are responsible for the new libMesh MOOSE-test CI failures? E.g. https://civet.inl.gov/job/2127543/ - 3 of our recipes are giving "unknown dependency" failures from the non-skipped misc/signal_handler.test_signal_*recover* tests.

roystgnr avatar Mar 19 '24 16:03 roystgnr

only thing is common is this delete_output_before_running parameter

GiudGiud avatar Mar 19 '24 16:03 GiudGiud

I'll try to replicate the signal_handler failure in a container on rod and report back.

pbehne avatar Mar 19 '24 17:03 pbehne

I spoke with @milljm, and he says there really isn't a way to easily replicate the environment used in the CIVET job. He can comment more on the specifics.

pbehne avatar Mar 19 '24 17:03 pbehne

I am looking a little more closely at this and I think this failing target was neglected to move over to use containers? I am seeing this target loading the following modules:

 5) moose-petsc-0b2deac-gcc-9.4.0-4wj7ppk <aL>      
 6) moose-petsc/0b2deac <aL>                        
 7) moose-petsc/libmesh_ci                          
 8) python-3.7.8-gcc-9.4.0-g

and I am scratching my head as to how this works. I don't see PETSc being built, so we must be using some old version indeed.

I think this target needs to be switched over to use our containers.

milljm avatar Mar 19 '24 17:03 milljm

The full list of checks that are failing with this error is:

Test 32bit
Test MOOSE min clang
Test debug

They are all checks that run on libmesh PRs, so they might have been overlooked.

jwpeterson avatar Mar 19 '24 20:03 jwpeterson

I'm working on these slowly.

loganharbour avatar Mar 19 '24 21:03 loganharbour