E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Add SCREAMv1 test to e3sm_gpucxx suite

Open brhillman opened this issue 1 year ago • 42 comments

Add F2010-SCREAMv1 test to e3sm_gpucxx suite to get test coverage on GPU for EAMxx codebase in main E3SM repo.

[BFB]

brhillman avatar Mar 02 '23 15:03 brhillman

./cime/scripts/create_test ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1 --compiler gnugpu

with latest master on Summit runs into bld errors:

components/eamxx/src/dynamics/homme/atmosphere_dynamics.cpp(817): error: class "Homme::SimulationParams" has no member "rearth"
/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/ext/new_allocator.h(147): error: no instance of constructor "Homme::Diagnostics::Diagnostics" matches the argument list
            argument types are: (int, __nv_bool)

On Crusher with --compiler crayclanggpu:

externals/ekat/extern/kokkos/core/src/../../tpls/desul/include/desul/atomics/Generic.hpp:618:9: error: call to 'atomic_fetch_add' is ambiguous
  (void)atomic_fetch_add(dest, val, order, scope);
        ^~~~~~~~~~~~~~~~

Does this look familiar in runs of this case with SCREAM repo?

amametjanov avatar Mar 04 '23 00:03 amametjanov

Thanks @amametjanov this actually passes for me on the SCREAM repo, and confirmed it fails with a local master merge into E3SM on summit. I'll try to figure out what's going on with the master merge.

brhillman avatar Mar 07 '23 14:03 brhillman

Okay, I see what happened. It looks like #5481 changed some things in hommexx that haven't been fixed upstream in E3SM yet. The fixes are on the SCREAM repo, so I will need to open another PR to pull those into E3SM I think, or pull those into this PR (looks like the diffs only affect sources in the eamxx directory, plus shoc).

brhillman avatar Mar 07 '23 15:03 brhillman

@rljacob I had to pull in a scream->e3sm merge because there were conflicts between the two repos since this PR was opened that prevented SCREAMv1 from building on the E3SM repo. I can move this to a separate PR, or rename this one to reflect this. Was waiting on my tests to update this PR (which just passed).

brhillman avatar Apr 05 '23 22:04 brhillman

Please make a separate PR.

rljacob avatar Apr 05 '23 22:04 rljacob

If you need to resync SCREAM with E3SM and its 300+ commits, that should be its own PR.

rljacob avatar Apr 05 '23 23:04 rljacob

This requires #5582 to fix SCREAMv1 build errors.

brhillman avatar Apr 05 '23 23:04 brhillman

If you need to resync SCREAM with E3SM and its 300+ commits, that should be its own PR.

Done. My reasoning here was to consider this PR the upstream merge, and as a side effect add the test to make sure it doesn’t break again, rather than considering the upstream merge the effect. In any event, #5582 is opened to bring EAMxx up to date, then this PR will just add the test.

brhillman avatar Apr 05 '23 23:04 brhillman

Do we need to add machine-specific cmake in components/eamxx/cmake/machine-files/? From a JLSE run:

-- Found scream component
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/LINUX.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/jlse.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/oneapi-ifxgpu_LINUX.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/oneapi-ifxgpu_jlse.cmake
CMake Error at cmake/build_eamxx.cmake:39 (include):
  include could not find requested file:

    /gpfs/jlse-fs0/projects/climate/testing/E3SM/components/eamxx/cmake/machine-files/jlse.cmake
Call Stack (most recent call first):
  CMakeLists.txt:120 (build_eamxx)

Similar error on Crusher: https://my.cdash.org/viewTest.php?buildid=2341352

Works on Ascent: https://my.cdash.org/viewTest.php?buildid=2341401

amametjanov avatar May 26 '23 02:05 amametjanov

notes: needs some syncing and restesting. @brhillman will do.

rljacob avatar Jul 06 '23 17:07 rljacob

@brhillman is this ready now?

rljacob avatar Jul 12 '23 05:07 rljacob

If I copy crusher-scream-gpu.cmake to components/eamxx/cmake/machine-files/crusher.cmake, the build succeeds, but the run fails during atm-init with

$ tail /lustre/orion/cli115/proj-shared/azamat/e3sm_scratch/crusher/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.G.20230714-chk-new-scream-gputest4/run/e3sm.log.350998.230714-173930 
0: (seq_comm_printcomms)    31     0     1     1  ALLESPID:
0: (seq_comm_printcomms)    32     0     8     1  CPLALLESPID:
0: (seq_comm_printcomms)    33     0     1     1  ESP:
0: (seq_comm_printcomms)    34     0     8     1  CPLESP:
0: (seq_comm_printcomms)    35     0     1     1  ALLIACID:
0: (seq_comm_printcomms)    36     0     8     1  CPLALLIACID:
0: (seq_comm_printcomms)    37     0     1     1  IAC:
0: (seq_comm_printcomms)    38     0     8     1  CPLIAC:
srun: error: crusher188: tasks 0-7: Bus error
srun: Terminating StepId=350998.0

amametjanov avatar Jul 14 '23 22:07 amametjanov

Just FYI. Testing with PR #5745 that removes the CNL.cmake file, https://my.cdash.org/test/85523812 results in errors:

Essentially, looking for non-existent macro files (Linux.cmake, crusher.cmake and crayclanggpu_Linux.cmake).

165 No macro file found: /lustre/orion/cli115/proj-shared/sarat/e3sm_scratch/crusher/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.C.JNextGpucxx20230714_154451/cmake_macros/Linux.cmake
166 No macro file found: /lustre/orion/cli115/proj-shared/sarat/e3sm_scratch/crusher/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.C.JNextGpucxx20230714_154451/cmake_macros/crusher.cmake
167 No macro file found: /lustre/orion/cli115/proj-shared/sarat/e3sm_scratch/crusher/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.C.JNextGpucxx20230714_154451/cmake_macros/crayclanggpu_Linux.cma    ke

sarats avatar Jul 16 '23 16:07 sarats

note: build fail is being looked at.

rljacob avatar Aug 10 '23 17:08 rljacob

So, the test passes with crusher-scream-gpu_crayclang-scream for machine_compiler. I guess one option would be to separate the scream and mmf tests into two different suites and use the scream-specific machine/compiler combination for this particular test.

brhillman avatar Aug 10 '23 21:08 brhillman

That's good to know but we want to move away from the scream-specific compiler settings.

rljacob avatar Aug 10 '23 22:08 rljacob

@jgfouca this is the PR about the scream and E3SM build.

rljacob avatar Aug 24 '23 16:08 rljacob

waiting on cmake refactor and then unification of scream and e3sm crusher configs.

rljacob avatar Oct 12 '23 17:10 rljacob

@jgfouca any progress on getting scream and e3sm machine/compiler descriptions unified?

rljacob avatar Dec 04 '23 19:12 rljacob

@rljacob , currently waiting on a couple CIME PRs and then I will do a CIME update. Once that is done, I will do an upstream merge to SCREAM, which will force me to sort out the remaining build system issues.

jgfouca avatar Dec 04 '23 20:12 jgfouca

@brhillman I believe this modified suite will now build and run after merging to next. Can you confirm?

rljacob avatar Jan 29 '24 17:01 rljacob

@amametjanov can you verify the new version of this suite will build on a system running gpucxx ?

rljacob avatar Feb 01 '24 20:02 rljacob

It probably needs to be rebased to get Jim's recent build system changes.

rljacob avatar Feb 01 '24 20:02 rljacob

I'm trying on pm-gpu on latest E3SM master with

./cime/scripts/create_test --machine pm-gpu --compiler gnugpu ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1

but getting

CMake Error at eamxx/src/physics/rrtmgp/CMakeLists.txt:39 (target_link_libraries):
  INTERFACE library can only be used with the INTERFACE keyword of
  target_link_libraries


-- Disabling all warnings for target yakl
CMake Error at /global/u2/a/azamat/saul/E3SM/externals/ekat/cmake/EkatUtils.cmake:102 (target_compile_options):
  target_compile_options may only set INTERFACE properties on INTERFACE
  targets
Call Stack (most recent call first):
  eamxx/src/physics/rrtmgp/CMakeLists.txt:40 (EkatDisableAllWarning)


-- Found CUDA: /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7 (found version "11.7")
CMake Error at /global/u2/a/azamat/saul/E3SM/externals/ekat/cmake/EkatSetCompilerFlags.cmake:379 (target_compile_options):
  target_compile_options may only set INTERFACE properties on INTERFACE
  targets
Call Stack (most recent call first):
  eamxx/src/physics/rrtmgp/CMakeLists.txt:53 (SetCudaFlags)

amametjanov avatar Mar 07 '24 04:03 amametjanov

@amametjanov , I've fixed that on the EAMXX fork. I will do a downstream merge soon.

jgfouca avatar Mar 07 '24 17:03 jgfouca

The downstream merge is done, merged to next.

jgfouca avatar May 02 '24 17:05 jgfouca

The only place we run this test is pm-gpu and looks like SCREAM has a runtime error: https://my.cdash.org/viewTest.php?onlyfailed&buildid=2558361

At least its not build time!

rljacob avatar May 08 '24 16:05 rljacob

Rob that job looks like it ran out of walltime

ndkeen avatar May 08 '24 17:05 ndkeen

@jgfouca how do we increase the walltime for the scream test?

rljacob avatar May 15 '24 21:05 rljacob

@rljacob , there are a few ways. This new test in the suite e3sm_gpucxx which does not have "time" field. We could add this field and set it to something long enough to give it a chance to finish. As an alternative, if this test is only too slow on one platform, we can go to that machine, run the test by hand ./create_test $test --walltime=4:00:00. If it passes, it will store the runtime under the $baseline/walltimes area and our "smart" walltime system will give that time the highest precedence when choosing a default time.

jgfouca avatar May 15 '24 22:05 jgfouca