E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Update ekat to a version that has Kokkos 4.2 as submodule

Open bartgol opened this issue 1 year ago • 21 comments

This PR will take time to integrate, I'm opening it so I can keep track of what I check.

  • [x] e3sm_integration:
    • [x] chrysalis (intel): all PASS
    • [x] pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the e3sm_integration_next_intel nightly build)
  • [x] e3sm_developer:
    • [x] pm-cpu (gnu): 75PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
  • [x] homme_integration
    • [x] chrysalis (intel): PASS
    • [x] pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways
  • [x] eamxx testing (from eamxx repo, with a few additional commits for eamxx)
    • [x] v1 (CIME)
      • [x] chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0)
      • [x] pm-cpu (gnu): 3 PASS, 1 DIFF
      • [x] frontier PEND
      • [x] ~ascent~: no longer part of eamxx nightlies
      • [x] pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch.
    • [x] standalone
      • [x] mappy (gnu): all PASS
      • [x] weaver (gnu+cuda): all PASS

@ambrad @oksanaguba can you think of any more machine/testsuite I should run?

bartgol avatar Dec 05 '23 23:12 bartgol

PR Preview Action v1.4.7 :---: :rocket: Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6101/ on branch gh-pages at 2024-04-25 18:35 UTC

github-actions[bot] avatar Dec 05 '23 23:12 github-actions[bot]

Your list is quite comprehensive. Crusher seems to be in bad shape, so you might need to run the EAMxx tests on Frontier. The key EAMxx v1 tests are the ERS/P and PEM ones; there aren't baselines, so the only testing is for restart/PE-layout BFBness.

There is one additional set of tests you might consider running, to assure C++/F90 BFBness of the dycore: the Homme standalone tests on Summit or Ascent. I recently added code to make it easy to run these. You can use homme/cmake/machineFiles/summit-bfb.cmake on both Summit and Ascent, and that file does the config necessary to get easy BFB ctest'ing. I usually get an interactive node (bsub -Is -W 0:60 -nnodes 1 -P cli115 /bin/bash), start with ctest -R _ut just to make sure there's nothing obvious the unit tests see, then proceed with the full test suite.

ambrad avatar Dec 06 '23 03:12 ambrad

Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).

bartgol avatar Dec 06 '23 17:12 bartgol

Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).

Keep in mind the baselines issue is true for every platform except Chrysalis for SCREAMv1 tests.

ambrad avatar Dec 06 '23 17:12 ambrad

An update on this. I am hitting NaNs on Chrysalis, and I tracked it down to some packed scan operations. The core issue is that, when initializing the result var of a scan op, Kokkos uses the default constructor of "ValueType". For ekat::Pack, that ctor inits everything to NaN (to easily track uninited-stuff). I'm discussing with kokkos folks as of why they don't use something like Kokkos::reduction_identity<ValueType>::sum(), which seems appropriate. Once I hear back from them, I'll know how to better tackle the issue (which may be "wait for Kokkos 4.3.00 or 4.2.01").

bartgol avatar Jan 25 '24 16:01 bartgol

For SCREAMv1 compset testing on Chrysalis, the fails all look the same. The stacktrace (see below) seems to point to some sort of error during MPI initialization, which, beside of being completely out of our control, is also completely independent on Kokkos.

25: forrtl: error (65): floating invalid
25: Image              PC                Routine            Line        Source    
25: libpnetcdf.so.3.0  000015555171E68C  for__signal_handl     Unknown  Unknown
25: libpthread-2.28.s  00001555453EFCF0  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402AA57F  ucp_proto_perf_en     Unknown  Unknown
25: libucp.so.0.0.0    00001555402AAA50  ucp_proto_init_pa     Unknown  Unknown
25: libucp.so.0.0.0    00001555402AB8EC  ucp_proto_common_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402B127C  ucp_proto_multi_i     Unknown  Unknown
25: libucp.so.0.0.0    00001555402E013A  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B1CDB  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B2BB2  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402B2DC4  ucp_proto_select_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402B39A7  ucp_proto_select_     Unknown  Unknown
25: libucp.so.0.0.0    00001555402A30D8  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    00001555402A330E  ucp_worker_get_ep     Unknown  Unknown
25: libucp.so.0.0.0    0000155540309ADD  ucp_wireup_init_l     Unknown  Unknown
25: libucp.so.0.0.0    000015554028CF75  ucp_ep_create_to_     Unknown  Unknown
25: libucp.so.0.0.0    000015554028D714  Unknown               Unknown  Unknown
25: libucp.so.0.0.0    000015554028DB8E  ucp_ep_create         Unknown  Unknown
25: libmpi.so.40.30.3  0000155545AA7607  mca_pml_ucx_add_p     Unknown  Unknown
25: libmpi.so.40.30.3  0000155545B0D723  ompi_mpi_init         Unknown  Unknown
25: libmpi.so.40.30.3  00001555458D004D  MPI_Init              Unknown  Unknown
25: libmpi_mpifh.so.4  0000155545E729D7  PMPI_Init_f08         Unknown  Unknown
25: e3sm.exe           0000000000437E05  cime_comp_mod_mp_         708  cime_comp_mod.F90
25: e3sm.exe           0000000000499955  MAIN__                     63  cime_driver.F90
25: e3sm.exe           0000000000437D22  Unknown               Unknown  Unknown
25: libc-2.28.so       0000155545052D85  __libc_start_main     Unknown  Unknown
25: e3sm.exe           0000000000437C2E  Unknown               Unknown  Unknown

But scream nightlies are also getting that error, and Rob mentioned an upgrade to chrys drivers that is causing issues, with a fix worked on by ANL folks. No need to sweat on chrys fails (yet).

bartgol avatar Mar 14 '24 22:03 bartgol

@bartgol Chrysalis had some some updates last week that may have caused the MPI fails. Please try your tests again.

rljacob avatar Mar 21 '24 16:03 rljacob

@bartgol how is this going?

rljacob avatar Apr 12 '24 04:04 rljacob

@rljacob I was out almost 2 weeks due to knee surgery. I am back now, and this is a priority on my todo list. I think I just need to check EAMxx testing on frontier, and then we can integrate. It's a pain to test so many testsuites manually, since by the time I figure out the fix for one DIFF/FAIL, some other build will fail due to master baselines being updated (forcing a rebase). So as soon as I confirm that eamxx on frontier is ok, I would like to merge to next, to start integration.

bartgol avatar Apr 16 '24 00:04 bartgol

@rljacob I think this branch is ready for integration. Can we pipeline it? I think there were 2 diffs in total, but keeping up with rebases was a pain, so I'd like to give it a shot with next testing...

bartgol avatar Apr 25 '24 18:04 bartgol

pipeline it? github says there's no conflicts.

rljacob avatar Apr 25 '24 18:04 rljacob

I mean, I don't know if next is open, and/or if other PRs were already scheduled for integration. I just want this to be put in line.

Pinging @jgfouca as well, since he's the assignee.

bartgol avatar Apr 25 '24 18:04 bartgol

Btw, @rljacob this PR includes the mod that is pipelined in eamxx via E3SM-Project/scream#2799. Would you like to do a similar PR in E3SM first, and then integrate this PR?

bartgol avatar Apr 25 '24 18:04 bartgol

No its ok to be in this PR.

rljacob avatar Apr 26 '24 04:04 rljacob

Is this ready to merge to next?

jgfouca avatar Apr 29 '24 17:04 jgfouca

Jim, I think we can merge to next, yes.

bartgol avatar Apr 29 '24 17:04 bartgol

Merged to next.

jgfouca avatar Apr 29 '24 17:04 jgfouca

Update: we reverted the merge to next, since it will likely conflict with #6226 . We will resume integration of this PR once that one is merged.

bartgol avatar Apr 29 '24 21:04 bartgol

Reverted off of next.

jgfouca avatar Apr 29 '24 21:04 jgfouca

Merged to next

jgfouca avatar May 02 '24 17:05 jgfouca

The fails on CDash of next, as of May 9th are a bunch. Excluding the I and G cases, which should not depend on ekat/kokkos, we have the builds listed below. As I go through the builds, I'll add an explanation of the fails next to them, and if they are not this PR's fault, I'll check them out

pm-cpu, e3sm_integration_next_intel:

  • [x] SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.pm-cpu_intel: build FAIL, but across builds and also in master.
  • [x] SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-wcprod: DIFF fail with File 'xyz' had no original counterpart in '<CASE>/run' with suffix ''. next is not generating eam.h5 output stream. Not this PR's fault.
  • [x] SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.

chrysalis, e3sm_integration_next_intel:

  • [x] SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.chrysalis_intel.allactive-wcprodssp: fails in master as well
  • [x] SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.chrysalis_intel: fails in master as well

pm-cpu, e3sm_prod_next_intel:

  • [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.
  • [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.pm-cpu_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.

compy, e3sm_prod_next_intel:

  • [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.compy_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.
  • [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.compy_intel.allactive-wcprodssp: next is not generating eam.h5 and eam.h6 output streams. Not this PR's fault.

mappy, e3sm_developer_next_gnu:

  • [x] SMS_D_Ln5.ne4pg2_oQU480.F2010.mappy_gnu: I get a segfault in both next and master
  • [x] SMS_R_Ld5.ne4_ne4.FSCM-ARM97.mappy_gnu.eam-scm: I get same DIFF in next and master

anvil, e3sm_prod_next_intel: all thee jobs seem to hit some batch scheduler issue. They either get canceled while running, or they are submitted but never produce any log in RUNDIR. It has been like this for a few days. I'm thinking it's nothing to do with this PR.

  • [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.F20TR.anvil_intel.eam-wcprod_F20TR
  • [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-1pctCO2.anvil_intel.allactive-wcprod_1850_1pctCO2
  • [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-4xCO2.anvil_intel.allactive-wcprod_1850_4xCO2
  • [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.anvil_intel.allactive-wcprod_1850
  • [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.anvil_intel.allactive-wcprodssp
  • [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.anvil_intel.allactive-wcprodssp
  • [ ] SMS_Ld1_PS.northamericax4v1pg2_WC14to60E2r3.WCYCL1850.anvil_intel.allactive-wcprodrrm_1850
  • [ ] SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.anvil_intel.eam-wcprod_F2010

bebop, e3sm_extra_coverage_next_intel:

  • [ ] ERP_Ld3.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel.allactive-pioroot1
  • [ ] ERP_Ld3.ne4pg2_oQU480.F2010.bebop_intel.eam-condidiag_dcape
  • [ ] ERP_Ld3.ne4pg2_oQU480.F2010.bebop_intel.eam-condidiag_rhi
  • [ ] ERP_Lm3.ne4pg2_oQU480.F2010.bebop_intel
  • [ ] ERS_Ld31.ne4pg2_oQU480.F2010.bebop_intel
  • [ ] ERS_Ld5.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel.eam-implicit_stress
  • [ ] SMS_D_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel
  • [ ] SMS_D_Ln5.ne45pg2_ne45pg2.FAQP.bebop_intel
  • [ ] SMS_D_Ln5.ne4pg2_oQU480.F2010.bebop_intel.eam-implicit_stress
  • [ ] SMS_Lm1.ne4pg2_oQU480.F2010.bebop_intel
  • [ ] SMS_Ly1.ne4pg2_oQU480.F2010.bebop_intel

pm-cpu, e3sm_superbfb_next_intel:

  • [x] PET_Ld3_D.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel.pemod-omp2: now PASSes

bartgol avatar May 10 '24 02:05 bartgol

@jgfouca @rljacob I went through the yellow boxes of the MustPass and MustPass_wBaseline builds on cdash. I only checked F cases, since from what I understand CRYO/G/I cases are not using active atm, so they are not building kokkos.

For all failures I found a reason that seems to be unrelated with this PR. The only builds I can't deem as "ok" (at least from the point of view of merging this PR) are the bebop builds, since we need the new modules PR to go in order for kokkos 4.2 to be happy.

I am thinking that we could merge this PR as is, since the passes with Intel on other platforms make me confident we won't have many surprises once the bebop modules PR goes in (but I will of course keep an eye out, and jump in if F cases still fail due to kokkos shenanigans once that PR goes in).

What are your thoughts?

bartgol avatar May 15 '24 00:05 bartgol

Yes its fine to merge this without waiting for the bebop fixes.

rljacob avatar May 15 '24 01:05 rljacob