E3SM
E3SM copied to clipboard
Add SCREAMv1 test to e3sm_gpucxx suite
Add F2010-SCREAMv1 test to e3sm_gpucxx suite to get test coverage on GPU for EAMxx codebase in main E3SM repo.
[BFB]
./cime/scripts/create_test ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1 --compiler gnugpu
with latest master on Summit runs into bld errors:
components/eamxx/src/dynamics/homme/atmosphere_dynamics.cpp(817): error: class "Homme::SimulationParams" has no member "rearth"
/autofs/nccs-svm1_sw/summit/gcc/9.1.0-alpha+20190716/include/c++/9.1.0/ext/new_allocator.h(147): error: no instance of constructor "Homme::Diagnostics::Diagnostics" matches the argument list
argument types are: (int, __nv_bool)
On Crusher with --compiler crayclanggpu
:
externals/ekat/extern/kokkos/core/src/../../tpls/desul/include/desul/atomics/Generic.hpp:618:9: error: call to 'atomic_fetch_add' is ambiguous
(void)atomic_fetch_add(dest, val, order, scope);
^~~~~~~~~~~~~~~~
Does this look familiar in runs of this case with SCREAM repo?
Thanks @amametjanov this actually passes for me on the SCREAM repo, and confirmed it fails with a local master merge into E3SM on summit. I'll try to figure out what's going on with the master merge.
Okay, I see what happened. It looks like #5481 changed some things in hommexx that haven't been fixed upstream in E3SM yet. The fixes are on the SCREAM repo, so I will need to open another PR to pull those into E3SM I think, or pull those into this PR (looks like the diffs only affect sources in the eamxx directory, plus shoc).
@rljacob I had to pull in a scream->e3sm merge because there were conflicts between the two repos since this PR was opened that prevented SCREAMv1 from building on the E3SM repo. I can move this to a separate PR, or rename this one to reflect this. Was waiting on my tests to update this PR (which just passed).
Please make a separate PR.
If you need to resync SCREAM with E3SM and its 300+ commits, that should be its own PR.
This requires #5582 to fix SCREAMv1 build errors.
If you need to resync SCREAM with E3SM and its 300+ commits, that should be its own PR.
Done. My reasoning here was to consider this PR the upstream merge, and as a side effect add the test to make sure it doesn’t break again, rather than considering the upstream merge the effect. In any event, #5582 is opened to bring EAMxx up to date, then this PR will just add the test.
Do we need to add machine-specific cmake in components/eamxx/cmake/machine-files/
?
From a JLSE run:
-- Found scream component
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/LINUX.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/jlse.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/oneapi-ifxgpu_LINUX.cmake
No macro file found: /gpfs/jlse-fs0/projects/climate/azamat/scratch/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.jlse_oneapi-ifxgpu.JNextGpucxx20230525_060459/cmake_macros/oneapi-ifxgpu_jlse.cmake
CMake Error at cmake/build_eamxx.cmake:39 (include):
include could not find requested file:
/gpfs/jlse-fs0/projects/climate/testing/E3SM/components/eamxx/cmake/machine-files/jlse.cmake
Call Stack (most recent call first):
CMakeLists.txt:120 (build_eamxx)
Similar error on Crusher: https://my.cdash.org/viewTest.php?buildid=2341352
Works on Ascent: https://my.cdash.org/viewTest.php?buildid=2341401
notes: needs some syncing and restesting. @brhillman will do.
@brhillman is this ready now?
If I copy crusher-scream-gpu.cmake
to components/eamxx/cmake/machine-files/crusher.cmake
, the build succeeds, but the run fails during atm-init with
$ tail /lustre/orion/cli115/proj-shared/azamat/e3sm_scratch/crusher/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.G.20230714-chk-new-scream-gputest4/run/e3sm.log.350998.230714-173930
0: (seq_comm_printcomms) 31 0 1 1 ALLESPID:
0: (seq_comm_printcomms) 32 0 8 1 CPLALLESPID:
0: (seq_comm_printcomms) 33 0 1 1 ESP:
0: (seq_comm_printcomms) 34 0 8 1 CPLESP:
0: (seq_comm_printcomms) 35 0 1 1 ALLIACID:
0: (seq_comm_printcomms) 36 0 8 1 CPLALLIACID:
0: (seq_comm_printcomms) 37 0 1 1 IAC:
0: (seq_comm_printcomms) 38 0 8 1 CPLIAC:
srun: error: crusher188: tasks 0-7: Bus error
srun: Terminating StepId=350998.0
Just FYI. Testing with PR #5745 that removes the CNL.cmake file, https://my.cdash.org/test/85523812 results in errors:
Essentially, looking for non-existent macro files (Linux.cmake, crusher.cmake and crayclanggpu_Linux.cmake).
165 No macro file found: /lustre/orion/cli115/proj-shared/sarat/e3sm_scratch/crusher/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.C.JNextGpucxx20230714_154451/cmake_macros/Linux.cmake
166 No macro file found: /lustre/orion/cli115/proj-shared/sarat/e3sm_scratch/crusher/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.C.JNextGpucxx20230714_154451/cmake_macros/crusher.cmake
167 No macro file found: /lustre/orion/cli115/proj-shared/sarat/e3sm_scratch/crusher/J/ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.crusher_crayclanggpu.C.JNextGpucxx20230714_154451/cmake_macros/crayclanggpu_Linux.cma ke
note: build fail is being looked at.
So, the test passes with crusher-scream-gpu_crayclang-scream
for machine_compiler. I guess one option would be to separate the scream and mmf tests into two different suites and use the scream-specific machine/compiler combination for this particular test.
That's good to know but we want to move away from the scream-specific compiler settings.
@jgfouca this is the PR about the scream and E3SM build.
waiting on cmake refactor and then unification of scream and e3sm crusher configs.
@jgfouca any progress on getting scream and e3sm machine/compiler descriptions unified?
@rljacob , currently waiting on a couple CIME PRs and then I will do a CIME update. Once that is done, I will do an upstream merge to SCREAM, which will force me to sort out the remaining build system issues.
@brhillman I believe this modified suite will now build and run after merging to next. Can you confirm?
@amametjanov can you verify the new version of this suite will build on a system running gpucxx ?
It probably needs to be rebased to get Jim's recent build system changes.
I'm trying on pm-gpu on latest E3SM master
with
./cime/scripts/create_test --machine pm-gpu --compiler gnugpu ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1
but getting
CMake Error at eamxx/src/physics/rrtmgp/CMakeLists.txt:39 (target_link_libraries):
INTERFACE library can only be used with the INTERFACE keyword of
target_link_libraries
-- Disabling all warnings for target yakl
CMake Error at /global/u2/a/azamat/saul/E3SM/externals/ekat/cmake/EkatUtils.cmake:102 (target_compile_options):
target_compile_options may only set INTERFACE properties on INTERFACE
targets
Call Stack (most recent call first):
eamxx/src/physics/rrtmgp/CMakeLists.txt:40 (EkatDisableAllWarning)
-- Found CUDA: /opt/nvidia/hpc_sdk/Linux_x86_64/22.7/cuda/11.7 (found version "11.7")
CMake Error at /global/u2/a/azamat/saul/E3SM/externals/ekat/cmake/EkatSetCompilerFlags.cmake:379 (target_compile_options):
target_compile_options may only set INTERFACE properties on INTERFACE
targets
Call Stack (most recent call first):
eamxx/src/physics/rrtmgp/CMakeLists.txt:53 (SetCudaFlags)
@amametjanov , I've fixed that on the EAMXX fork. I will do a downstream merge soon.
The downstream merge is done, merged to next.
The only place we run this test is pm-gpu and looks like SCREAM has a runtime error: https://my.cdash.org/viewTest.php?onlyfailed&buildid=2558361
At least its not build time!
Rob that job looks like it ran out of walltime
@jgfouca how do we increase the walltime for the scream test?
@rljacob , there are a few ways. This new test in the suite e3sm_gpucxx
which does not have "time" field. We could add this field and set it to something long enough to give it a chance to finish. As an alternative, if this test is only too slow on one platform, we can go to that machine, run the test by hand ./create_test $test --walltime=4:00:00
. If it passes, it will store the runtime under the $baseline/walltimes area and our "smart" walltime system will give that time the highest precedence when choosing a default time.