E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

EAMxx: update EKAT submodule and adapt EAMxx

Open bartgol opened this issue 7 months ago • 6 comments

Update the EKAT submodule and change EAMxx to conform to the new version of EKAT.


A (very disruptive) PR in EKAT (not yet integrated) will break EKAT into sub-packages, to facilitate its use in other applications without the need to bring in every single ekat dependency.

This PR adapts to those changes, which can be summarized in the following points:

  • There is no longer an ekat library, but a bunch of ekat::XYZ libraries, including a ekat::AllLibs one (for convenience).
  • No longer use paths in includes: customer should just include <ekat_blah.hpp>, without any path (they should not care how ekat files are organized in the ekat repo)
  • ekat_kokkos_utils.hpp broken in two: ekat_reduction_utils.hpp and ekat_team_policy_utils.hpp
  • ExeSpaceUtils no longer persent. In its place, use TeamPolicyFactory (from ekat_team_policy_utils.hpp) for team policy creation, and ReductionUtils (from ekat_reduction_utils.hpp) for reduction utilities; both are templated on exec space, just like ExeSpaceUtils was.
  • ScalarTraits no longer provides invalid() and quiet_NaN(), but only provides type info. The functions quiet_NaN() and finite_max() have been added in ekat_math_utils.hpp, but may disappear once we have C++20 (see comment in ekat_pack.hpp about "introspective" constepxr)
  • ekat_file_utils.hpp has been purged, and all tests now simply use the standard library ifstream/ofstream capabilities.
  • There is no more an "ekat" session. The ekat::KokkosUtils has initialize_kokkos_session (and its finalize).
  • Minor changes related to how we print the current session configuration (some are in ekat::Core, some in ekat::KokkosUtils).
  • EkatCreateUnitTestExec no longer has EXCLUDE_TEST_SESSION, but instead has USER_DEFINED_TEST_SESSION, for more expressivity.

The biggest challenge was how to break ekat into N packages, since some utilities were "generic enough", but required knowledge about kokkos (e.g., printing the current arch configuration seems generic enough to be in ekat::Core, but the kokkos backend info requires kokkos to be compiled).

The current result seems to be the best compromise between flexibility (for new customers) and robustness (for existing customers).

I have NOT updated mam4xx/haero to use the new ekat, so all their tests may fail to build (or even configure). I am opening the PR anyways, in the hope to get a build of all the rest until mam4xx is update too.

IMPORTANT NOTE FOR REVIEWERS: I strongly recommend to review the commits individually. I tried to group changes by topic, so reviewing one commit may be simpler; ha couple of commits are quite large, but seeing the same pattern over and over may make it easier to review.

bartgol avatar May 16 '25 22:05 bartgol

@tcclevenger @jeff-cohere @jgfouca The deprecation of ekat::any and ekat::enable_shared_from_this is causing a vast amount of warnings. I suggest we switch to their std counterparts asap. I see three options:

  1. do it in this PR
  2. do it in a follow-up PR
  3. update ekat now (with current ekat master), and do the std change, THEN merge this PR.

Maybe option 3 is best, even though it ends up updating ekat twice in a short time?

bartgol avatar May 22 '25 20:05 bartgol

1 seems easiest.

jgfouca avatar May 22 '25 20:05 jgfouca

I agree with Jim. 1 would really be "asap". :-)

jeff-cohere avatar May 22 '25 20:05 jeff-cohere

I agree, 1.

tcclevenger avatar May 22 '25 20:05 tcclevenger

PR Preview Action v1.6.2 :---: |

:rocket: View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-7362/

|
Built to branch gh-pages at 2025-08-01 19:06 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

github-actions[bot] avatar May 27 '25 22:05 github-actions[bot]

The physics baselines tests are diffing. Before deciding whether or not to accept the diffs, I want to understand why there are diffs. Quite a lot has changed in this PR, so I need to dig a bit to find out where things drifted (could be cmake flags, implicit promotions, different checks ...).

bartgol avatar May 30 '25 03:05 bartgol

The cuda tests seem to hang. I have to investigate.

Edit: the test that hangs is always mam4_aero_microphys_standalone. I am trying to debug via cuda-gdb, but all I was able to deduce so far is that it hangs in the 1st run call. I'm trying to bisect the exact location.

Update: the issue was in mam4xx. A fix is in the pipeline.

bartgol avatar Jul 22 '25 21:07 bartgol