Trixi.jl icon indicating copy to clipboard operation
Trixi.jl copied to clipboard

Feature: Checkpointing for T8codeMesh

Open jmark opened this issue 1 year ago • 13 comments

This PR adds checkpointing for T8codeMesh. By this, routines like save_mesh and load_mesh are supported.

Closes #2044

jmark avatar Jun 13 '24 16:06 jmark

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • [ ] The PR has a single goal that is clear from the PR title and/or description.
  • [ ] All code changes represent a single set of modifications that logically belong together.
  • [ ] No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • [ ] The code can be understood easily.
  • [ ] Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • [ ] There are no redundancies that can be removed by simple modularization/refactoring.
  • [ ] There are no leftover debug statements or commented code sections.
  • [ ] The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • [ ] New functions and types are documented with a docstring or top-level comment.
  • [ ] Relevant publications are referenced in docstrings (see example for formatting).
  • [ ] Inline comments are used to document longer or unusual code sections.
  • [ ] Comments describe intent ("why?") and not just functionality ("what?").
  • [ ] If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • [ ] The PR passes all tests.
  • [ ] New or modified lines of code are covered by tests.
  • [ ] New or modified tests run in less then 10 seconds.

Performance

  • [ ] There are no type instabilities or memory allocations in performance-critical parts.
  • [ ] If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • [ ] The correctness of the code was verified using appropriate tests.
  • [ ] If new equations/methods are added, a convergence test has been run and the results are posted in the PR.

Created with :heart: by the Trixi.jl community.

github-actions[bot] avatar Jun 13 '24 16:06 github-actions[bot]

Codecov Report

Attention: Patch coverage is 97.88136% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.44%. Comparing base (16c6f17) to head (aca9742). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/auxiliary/t8code.jl 75.00% 3 Missing :warning:
src/callbacks_step/save_restart_dg.jl 66.67% 1 Missing :warning:
src/meshes/t8code_mesh.jl 99.38% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1980      +/-   ##
==========================================
- Coverage   96.37%   95.44%   -0.93%     
==========================================
  Files         480      480              
  Lines       37855    38028     +173     
==========================================
- Hits        36482    36295     -187     
- Misses       1373     1733     +360     
Flag Coverage Δ
unittests 95.44% <97.88%> (-0.93%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Jun 13 '24 17:06 codecov[bot]

Do you think the MPI failures are really unrelated?

No, I am not sure. I just know that we had stalling CI jobs before. And, looking trough the recent failures, I think it was not always the same elixir.

benegee avatar Jul 03 '24 19:07 benegee

Do you think the MPI failures are really unrelated?

No, I am not sure. I just know that we had stalling CI jobs before. And, looking trough the recent failures, I think it was not always the same elixir.

I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.

jmark avatar Jul 04 '24 09:07 jmark

I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.

Yes, could be related to OOM issues, cf. https://github.com/trixi-framework/Trixi.jl/issues/1471.

JoshuaLampert avatar Jul 04 '24 09:07 JoshuaLampert

I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.

Yes, could be related to OOM issues, cf. #1471.

I could narrow it down. It has something to do with Julia 10.1.4. With Julia 10.1.2 it does not stall. Investigating ...

jmark avatar Jul 05 '24 09:07 jmark

Are you able to reproduce the problem locally?

JoshuaLampert avatar Jul 05 '24 09:07 JoshuaLampert

Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...

jmark avatar Jul 05 '24 13:07 jmark

Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...

Are you sure it's related to the patch version bump? Are you using an identical Manifest.toml for both tests?

sloede avatar Jul 05 '24 13:07 sloede

Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...

Are you sure it's related to the patch version bump? Are you using an identical Manifest.toml for both tests?

Yes! Working from the exact same project folder. Just pointing the Julia binary to either 1.10.2 or 1.10.4.

jmark avatar Jul 05 '24 13:07 jmark

So it consistently stalls with Julia 1.10.4, but consistently works with Julia 1.10.2 in multiple runs? Did you monitor RAM usage during the simulation?

JoshuaLampert avatar Jul 06 '24 13:07 JoshuaLampert

So it consistently stalls with Julia 1.10.4, but consistently works with Julia 1.10.2 in multiple runs? Did you monitor RAM usage during the simulation?

Yes! RAM usage is not out of ordinary.

jmark avatar Jul 08 '24 09:07 jmark

I think I found the bug causing the stalls in the MPI runs. It was a silent memory leak/segfault. I added the fixes in the last commit. Furthermore, I changed the t8code C interface a tiny bit to simplify the code on Trixi side. This PR has to wait for the next breaking t8code release and specifically for the merge of this PR: https://github.com/DLR-AMR/t8code/pull/1115.

I'll try to push for a major t8code release by the end of next week.

jmark avatar Jul 09 '24 15:07 jmark

t8code 3.0.0 has been released and @jmark already updated t8code_jll.jl. Does T8code.jl need an update as well?

benegee avatar Nov 04 '24 07:11 benegee

t8code 3.0.0 has been released and @jmark already updated t8code_jll.jl. Does T8code.jl need an update as well?

Yes, indeed! Working on that.

jmark avatar Nov 04 '24 09:11 jmark

Concerning the failing invalidations check, see https://github.com/timholy/SnoopCompile.jl/issues/397

benegee avatar Nov 05 '24 08:11 benegee

@benegee As expected, several tests fail since some routines became obsolete with the breaking t8code 3.0.0 release. We have to iron out these issues first.

I suggest we fix these problems in this PR https://github.com/trixi-framework/Trixi.jl/pull/1939 first since the changes are minimal on Trixi side. Then merge that PR into the checkpointing PR.

jmark avatar Nov 05 '24 09:11 jmark