E3SM
E3SM copied to clipboard
Update ekat to a version that has Kokkos 4.2 as submodule
This PR will take time to integrate, I'm opening it so I can keep track of what I check.
- [x] e3sm_integration:
- [x] chrysalis (intel): all PASS
- [x] pm-cpu (intel) 127 PASS and 1 DIFF: SMS_Ld2.ne30pg2_r05_IcoswISC30E3r5.BGCEXP_CNTL_CNPECACNT_1850.pm-cpu_intel.elm-bgcexp (which is currently consistently failing in the
e3sm_integration_next_intel
nightly build)
- [x] e3sm_developer:
- [x] pm-cpu (gnu): 75PASS and 1 DIFF: ERP_Ld3.ne4pg2_oQU480.F2010.pm-cpu_gnu
- [x] homme_integration
- [x] chrysalis (intel): PASS
- [x] pm-cpu (gnu): builds PASS, stuck in Q for run, so I cancelled it. pm-cpu is not tested in nightlies anyways
- [x] eamxx testing (from eamxx repo, with a few additional commits for eamxx)
- [x] v1 (CIME)
- [x] chrysalis (intel): 10 PASS (all scream v1) 5 DIFF (all scream v0)
- [x] pm-cpu (gnu): 3 PASS, 1 DIFF
- [x] frontier PEND
- [x] ~ascent~: no longer part of eamxx nightlies
- [x] pm-gpu (gnugpu): 7 PASS, 5 DIFF. All DIFF are in debug mode, while all non-debug builds pass. I'm trying to understand what's the catch.
- [x] standalone
- [x] mappy (gnu): all PASS
- [x] weaver (gnu+cuda): all PASS
- [x] v1 (CIME)
@ambrad @oksanaguba can you think of any more machine/testsuite I should run?
PR Preview Action v1.4.7
:---:
:rocket: Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6101/
on branch gh-pages
at 2024-04-25 18:35 UTC
Your list is quite comprehensive. Crusher seems to be in bad shape, so you might need to run the EAMxx tests on Frontier. The key EAMxx v1 tests are the ERS/P and PEM ones; there aren't baselines, so the only testing is for restart/PE-layout BFBness.
There is one additional set of tests you might consider running, to assure C++/F90 BFBness of the dycore: the Homme standalone tests on Summit or Ascent. I recently added code to make it easy to run these. You can use homme/cmake/machineFiles/summit-bfb.cmake on both Summit and Ascent, and that file does the config necessary to get easy BFB ctest'ing. I usually get an interactive node (bsub -Is -W 0:60 -nnodes 1 -P cli115 /bin/bash
), start with ctest -R _ut
just to make sure there's nothing obvious the unit tests see, then proceed with the full test suite.
Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).
Great, thanks! I thought about Frontier, but I was held back by the fact that we don't have baselines there. However, ERS/P tests can still be useful, even without baselines, so I will def run those (if crusher is still sick).
Keep in mind the baselines issue is true for every platform except Chrysalis for SCREAMv1 tests.
An update on this. I am hitting NaNs on Chrysalis, and I tracked it down to some packed scan operations. The core issue is that, when initializing the result var of a scan op, Kokkos uses the default constructor of "ValueType". For ekat::Pack, that ctor inits everything to NaN (to easily track uninited-stuff). I'm discussing with kokkos folks as of why they don't use something like Kokkos::reduction_identity<ValueType>::sum()
, which seems appropriate. Once I hear back from them, I'll know how to better tackle the issue (which may be "wait for Kokkos 4.3.00 or 4.2.01").
For SCREAMv1 compset testing on Chrysalis, the fails all look the same. The stacktrace (see below) seems to point to some sort of error during MPI initialization, which, beside of being completely out of our control, is also completely independent on Kokkos.
25: forrtl: error (65): floating invalid
25: Image PC Routine Line Source
25: libpnetcdf.so.3.0 000015555171E68C for__signal_handl Unknown Unknown
25: libpthread-2.28.s 00001555453EFCF0 Unknown Unknown Unknown
25: libucp.so.0.0.0 00001555402AA57F ucp_proto_perf_en Unknown Unknown
25: libucp.so.0.0.0 00001555402AAA50 ucp_proto_init_pa Unknown Unknown
25: libucp.so.0.0.0 00001555402AB8EC ucp_proto_common_ Unknown Unknown
25: libucp.so.0.0.0 00001555402B127C ucp_proto_multi_i Unknown Unknown
25: libucp.so.0.0.0 00001555402E013A Unknown Unknown Unknown
25: libucp.so.0.0.0 00001555402B1CDB Unknown Unknown Unknown
25: libucp.so.0.0.0 00001555402B2BB2 Unknown Unknown Unknown
25: libucp.so.0.0.0 00001555402B2DC4 ucp_proto_select_ Unknown Unknown
25: libucp.so.0.0.0 00001555402B39A7 ucp_proto_select_ Unknown Unknown
25: libucp.so.0.0.0 00001555402A30D8 Unknown Unknown Unknown
25: libucp.so.0.0.0 00001555402A330E ucp_worker_get_ep Unknown Unknown
25: libucp.so.0.0.0 0000155540309ADD ucp_wireup_init_l Unknown Unknown
25: libucp.so.0.0.0 000015554028CF75 ucp_ep_create_to_ Unknown Unknown
25: libucp.so.0.0.0 000015554028D714 Unknown Unknown Unknown
25: libucp.so.0.0.0 000015554028DB8E ucp_ep_create Unknown Unknown
25: libmpi.so.40.30.3 0000155545AA7607 mca_pml_ucx_add_p Unknown Unknown
25: libmpi.so.40.30.3 0000155545B0D723 ompi_mpi_init Unknown Unknown
25: libmpi.so.40.30.3 00001555458D004D MPI_Init Unknown Unknown
25: libmpi_mpifh.so.4 0000155545E729D7 PMPI_Init_f08 Unknown Unknown
25: e3sm.exe 0000000000437E05 cime_comp_mod_mp_ 708 cime_comp_mod.F90
25: e3sm.exe 0000000000499955 MAIN__ 63 cime_driver.F90
25: e3sm.exe 0000000000437D22 Unknown Unknown Unknown
25: libc-2.28.so 0000155545052D85 __libc_start_main Unknown Unknown
25: e3sm.exe 0000000000437C2E Unknown Unknown Unknown
But scream nightlies are also getting that error, and Rob mentioned an upgrade to chrys drivers that is causing issues, with a fix worked on by ANL folks. No need to sweat on chrys fails (yet).
@bartgol Chrysalis had some some updates last week that may have caused the MPI fails. Please try your tests again.
@bartgol how is this going?
@rljacob I was out almost 2 weeks due to knee surgery. I am back now, and this is a priority on my todo list. I think I just need to check EAMxx testing on frontier, and then we can integrate. It's a pain to test so many testsuites manually, since by the time I figure out the fix for one DIFF/FAIL, some other build will fail due to master baselines being updated (forcing a rebase). So as soon as I confirm that eamxx on frontier is ok, I would like to merge to next, to start integration.
@rljacob I think this branch is ready for integration. Can we pipeline it? I think there were 2 diffs in total, but keeping up with rebases was a pain, so I'd like to give it a shot with next testing...
pipeline it? github says there's no conflicts.
I mean, I don't know if next is open, and/or if other PRs were already scheduled for integration. I just want this to be put in line.
Pinging @jgfouca as well, since he's the assignee.
Btw, @rljacob this PR includes the mod that is pipelined in eamxx via E3SM-Project/scream#2799. Would you like to do a similar PR in E3SM first, and then integrate this PR?
No its ok to be in this PR.
Is this ready to merge to next?
Jim, I think we can merge to next, yes.
Merged to next.
Update: we reverted the merge to next, since it will likely conflict with #6226 . We will resume integration of this PR once that one is merged.
Reverted off of next.
Merged to next
The fails on CDash of next, as of May 9th are a bunch. Excluding the I and G cases, which should not depend on ekat/kokkos, we have the builds listed below. As I go through the builds, I'll add an explanation of the fails next to them, and if they are not this PR's fault, I'll check them out
pm-cpu, e3sm_integration_next_intel:
- [x] SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.pm-cpu_intel: build FAIL, but across builds and also in master.
- [x] SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.pm-cpu_intel.allactive-wcprod: DIFF fail with
File 'xyz' had no original counterpart in '<CASE>/run' with suffix ''
. next is not generatingeam.h5
output stream. Not this PR's fault. - [x] SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating
eam.h5
andeam.h6
output streams. Not this PR's fault.
chrysalis, e3sm_integration_next_intel:
- [x] SMS_D_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.chrysalis_intel.allactive-wcprodssp: fails in master as well
- [x] SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF2.chrysalis_intel: fails in master as well
pm-cpu, e3sm_prod_next_intel:
- [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.pm-cpu_intel.allactive-wcprodssp: next is not generating
eam.h5
andeam.h6
output streams. Not this PR's fault. - [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.pm-cpu_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.
compy, e3sm_prod_next_intel:
- [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.compy_intel.allactive-wcprodssp: FAIL due to problem retrieving input data. Not this PR's fault.
- [x] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.compy_intel.allactive-wcprodssp: next is not generating
eam.h5
andeam.h6
output streams. Not this PR's fault.
mappy, e3sm_developer_next_gnu:
- [x] SMS_D_Ln5.ne4pg2_oQU480.F2010.mappy_gnu: I get a segfault in both next and master
- [x] SMS_R_Ld5.ne4_ne4.FSCM-ARM97.mappy_gnu.eam-scm: I get same DIFF in next and master
anvil, e3sm_prod_next_intel: all thee jobs seem to hit some batch scheduler issue. They either get canceled while running, or they are submitted but never produce any log in RUNDIR. It has been like this for a few days. I'm thinking it's nothing to do with this PR.
- [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.F20TR.anvil_intel.eam-wcprod_F20TR
- [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-1pctCO2.anvil_intel.allactive-wcprod_1850_1pctCO2
- [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850-4xCO2.anvil_intel.allactive-wcprod_1850_4xCO2
- [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCL1850.anvil_intel.allactive-wcprod_1850
- [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP370.anvil_intel.allactive-wcprodssp
- [ ] SMS_Ld1.ne30pg2_r05_IcoswISC30E3r5.WCYCLSSP585.anvil_intel.allactive-wcprodssp
- [ ] SMS_Ld1_PS.northamericax4v1pg2_WC14to60E2r3.WCYCL1850.anvil_intel.allactive-wcprodrrm_1850
- [ ] SMS_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.anvil_intel.eam-wcprod_F2010
bebop, e3sm_extra_coverage_next_intel:
- [ ] ERP_Ld3.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel.allactive-pioroot1
- [ ] ERP_Ld3.ne4pg2_oQU480.F2010.bebop_intel.eam-condidiag_dcape
- [ ] ERP_Ld3.ne4pg2_oQU480.F2010.bebop_intel.eam-condidiag_rhi
- [ ] ERP_Lm3.ne4pg2_oQU480.F2010.bebop_intel
- [ ] ERS_Ld31.ne4pg2_oQU480.F2010.bebop_intel
- [ ] ERS_Ld5.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel.eam-implicit_stress
- [ ] SMS_D_Ln5.ne30pg2_r05_IcoswISC30E3r5.F2010.bebop_intel
- [ ] SMS_D_Ln5.ne45pg2_ne45pg2.FAQP.bebop_intel
- [ ] SMS_D_Ln5.ne4pg2_oQU480.F2010.bebop_intel.eam-implicit_stress
- [ ] SMS_Lm1.ne4pg2_oQU480.F2010.bebop_intel
- [ ] SMS_Ly1.ne4pg2_oQU480.F2010.bebop_intel
pm-cpu, e3sm_superbfb_next_intel:
- [x] PET_Ld3_D.ne30pg2_EC30to60E2r2.WCYCL1850.pm-cpu_intel.pemod-omp2: now PASSes
@jgfouca @rljacob I went through the yellow boxes of the MustPass and MustPass_wBaseline builds on cdash. I only checked F cases, since from what I understand CRYO/G/I cases are not using active atm, so they are not building kokkos.
For all failures I found a reason that seems to be unrelated with this PR. The only builds I can't deem as "ok" (at least from the point of view of merging this PR) are the bebop builds, since we need the new modules PR to go in order for kokkos 4.2 to be happy.
I am thinking that we could merge this PR as is, since the passes with Intel on other platforms make me confident we won't have many surprises once the bebop modules PR goes in (but I will of course keep an eye out, and jump in if F cases still fail due to kokkos shenanigans once that PR goes in).
What are your thoughts?
Yes its fine to merge this without waiting for the bebop fixes.