E3SM
E3SM copied to clipboard
MMF Gordon Bell test fails to build on Frontier
While trying to test the planned Gordon Bell configuration on Frontier I initial ran into various build errors, but many of them were fixed by merging in the simulations/cess-production branch from the SCREAM fork. But I'm still left with a linking error that seems similar to what @xyuan and @jgfouca have reported on slack. @abbotts @sarats branch => whannah/Gordon-Bell-2024-test
Here is the error I'm getting on my current branch: ld.lld: error: undefined symbol: _gfortran_erfc_scaled_r8
> grep "Error" /lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/e3sm.bldlog.240105-120429 -B10
ld.lld: error: undefined symbol: _gfortran_erfc_scaled_r8
>>> referenced by The Cpu Module
>>> Tracer1beckBGCReactionsType.F90.o:(calc_bgc_reaction$tracer1beckbgcreactionstype_) in archive ../lnd/liblnd.a
>>> referenced by The Cpu Module
>>> Tracer1beckBGCReactionsType.F90.o:(calc_bgc_reaction$tracer1beckbgcreactionstype_) in archive ../lnd/liblnd.a
>>> referenced by The Cpu Module
>>> Tracer1beckBGCReactionsType.F90.o:(calc_bgc_reaction$tracer1beckbgcreactionstype_) in archive ../lnd/liblnd.a
>>> referenced 3 more times
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
Target /lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/e3sm.exe built in 23.205038 seconds
gmake[2]: *** [cmake/cpl/CMakeFiles/e3sm.exe.dir/build.make:569: /lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/e3sm.exe] Error 1
gmake[2]: Leaving directory '/lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/cmake-bld'
gmake[1]: *** [CMakeFiles/Makefile2:717: cmake/cpl/CMakeFiles/e3sm.exe.dir/all] Error 2
gmake[1]: Leaving directory '/lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/cmake-bld'
gmake: *** [Makefile:94: all] Error 2
Stating the obvious: this is trying to link in a GNU symbol while building with Cray compilers?
I'm not sure why building with OpenMP would trigger link error with this function: https://fortranwiki.org/fortran/show/erfc_scaled
@jgfouca had linking errors that turned out to be from the ulimit in his environment being to low.
Btw, the SCREAM branches were created to mimic the Frontier configuration used in E3SM/main but perhaps now involve a few new things as software configurations changed. To summarize, there shouldn't be any radical changes but you need to carefully review the edits you are bringing in by merging the scream/cess branch.
Merging the scream branch into your branch is probably bringing in a whole lot of commits which are not needed. Just cherry-pick machine changes that are needed.
I think the symbol and gfortran depency is coming from libcraymath, which requires libgfortran. We forward some math functions to it. I'm surprised it's not already getting linked in, though.
Perhaps adding the -craype-verbose flag to the link line would show us if it's missing
@whannah1 , I was able to get past that particular symbol by adding -lgfortran
to the link line. I also had to add -lmpifort
. I am still struggling with a couple more missing symbols (_dgemm_
and _idmax_
).
dgemm and idmax are from BLAS, do you have either libsci (preferred) or reference blas (netlib) linked?
@sarats , I was trying to use openblas, but the cess_production branch did need the openblas module, so something is wrong. I'm looking at it.
I don't understand why we need Openblas instead of libsci on a machine like Frontier?
@sarats , thanks for the tip, -lsci_cray
did resolve those missing symbols. I will try to redo the build without openblas.
@jgfouca Can you point me to your branch so I can see your changes?
@sarats good point about the unnecessary changes, I considered doing the merge "by hand" one file at a time when I was hit with a bunch of merge conflicts, but I figured it was quicker to merge in those unnecessary changes instead.
For future reference, if you can't use libsci for whatever reason, use AMD's libraries for AMD CPUs which is based on BLIS, a decent open source alternative.
https://www.amd.com/en/developer/aocl/dense.html
@whannah1 You probably need a few module updates in config_machines and maybe some minor things in cmake macros. At least, it would make it easy for me to review. Don't bring in other stuff unless you are really sure it's needed.
ok, I'm gonna go back to my basic branch before any machine changes and make some incremental changes based on @jgfouca 's branch
@whannah1 , I think I am very close to having a working build if you can wait like 30m.
@jgfouca is this fixed on E3SM master?
@rljacob , no, it's fixed in SCREAM.