Feature: Checkpointing for T8codeMesh
This PR adds checkpointing for T8codeMesh. By this, routines like save_mesh and load_mesh are supported.
Closes #2044
Review checklist
This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.
Purpose and scope
- [ ] The PR has a single goal that is clear from the PR title and/or description.
- [ ] All code changes represent a single set of modifications that logically belong together.
- [ ] No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.
Code quality
- [ ] The code can be understood easily.
- [ ] Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
- [ ] There are no redundancies that can be removed by simple modularization/refactoring.
- [ ] There are no leftover debug statements or commented code sections.
- [ ] The code adheres to our conventions and style guide, and to the Julia guidelines.
Documentation
- [ ] New functions and types are documented with a docstring or top-level comment.
- [ ] Relevant publications are referenced in docstrings (see example for formatting).
- [ ] Inline comments are used to document longer or unusual code sections.
- [ ] Comments describe intent ("why?") and not just functionality ("what?").
- [ ] If the PR introduces a significant change or new feature, it is documented in
NEWS.mdwith its PR number.
Testing
- [ ] The PR passes all tests.
- [ ] New or modified lines of code are covered by tests.
- [ ] New or modified tests run in less then 10 seconds.
Performance
- [ ] There are no type instabilities or memory allocations in performance-critical parts.
- [ ] If the PR intent is to improve performance, before/after time measurements are posted in the PR.
Verification
- [ ] The correctness of the code was verified using appropriate tests.
- [ ] If new equations/methods are added, a convergence test has been run and the results are posted in the PR.
Created with :heart: by the Trixi.jl community.
Codecov Report
Attention: Patch coverage is 97.88136% with 5 lines in your changes missing coverage. Please review.
Project coverage is 95.44%. Comparing base (
16c6f17) to head (aca9742). Report is 1 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #1980 +/- ##
==========================================
- Coverage 96.37% 95.44% -0.93%
==========================================
Files 480 480
Lines 37855 38028 +173
==========================================
- Hits 36482 36295 -187
- Misses 1373 1733 +360
| Flag | Coverage Δ | |
|---|---|---|
| unittests | 95.44% <97.88%> (-0.93%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Do you think the MPI failures are really unrelated?
No, I am not sure. I just know that we had stalling CI jobs before. And, looking trough the recent failures, I think it was not always the same elixir.
Do you think the MPI failures are really unrelated?
No, I am not sure. I just know that we had stalling CI jobs before. And, looking trough the recent failures, I think it was not always the same elixir.
I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.
I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.
Yes, could be related to OOM issues, cf. https://github.com/trixi-framework/Trixi.jl/issues/1471.
I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.
Yes, could be related to OOM issues, cf. #1471.
I could narrow it down. It has something to do with Julia 10.1.4. With Julia 10.1.2 it does not stall. Investigating ...
Are you able to reproduce the problem locally?
Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...
Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for
elixir_advection_restart.jl2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...
Are you sure it's related to the patch version bump? Are you using an identical Manifest.toml for both tests?
Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for
elixir_advection_restart.jl2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...Are you sure it's related to the patch version bump? Are you using an identical Manifest.toml for both tests?
Yes! Working from the exact same project folder. Just pointing the Julia binary to either 1.10.2 or 1.10.4.
So it consistently stalls with Julia 1.10.4, but consistently works with Julia 1.10.2 in multiple runs? Did you monitor RAM usage during the simulation?
So it consistently stalls with Julia 1.10.4, but consistently works with Julia 1.10.2 in multiple runs? Did you monitor RAM usage during the simulation?
Yes! RAM usage is not out of ordinary.
I think I found the bug causing the stalls in the MPI runs. It was a silent memory leak/segfault. I added the fixes in the last commit. Furthermore, I changed the t8code C interface a tiny bit to simplify the code on Trixi side. This PR has to wait for the next breaking t8code release and specifically for the merge of this PR: https://github.com/DLR-AMR/t8code/pull/1115.
I'll try to push for a major t8code release by the end of next week.
t8code 3.0.0 has been released and @jmark already updated t8code_jll.jl. Does T8code.jl need an update as well?
t8code 3.0.0 has been released and @jmark already updated t8code_jll.jl. Does T8code.jl need an update as well?
Yes, indeed! Working on that.
Concerning the failing invalidations check, see https://github.com/timholy/SnoopCompile.jl/issues/397
@benegee As expected, several tests fail since some routines became obsolete with the breaking t8code 3.0.0 release. We have to iron out these issues first.
I suggest we fix these problems in this PR https://github.com/trixi-framework/Trixi.jl/pull/1939 first since the changes are minimal on Trixi side. Then merge that PR into the checkpointing PR.