E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

MMF Gordon Bell test fails to build on Frontier

Open whannah1 opened this issue 1 year ago • 18 comments

While trying to test the planned Gordon Bell configuration on Frontier I initial ran into various build errors, but many of them were fixed by merging in the simulations/cess-production branch from the SCREAM fork. But I'm still left with a linking error that seems similar to what @xyuan and @jgfouca have reported on slack. @abbotts @sarats branch => whannah/Gordon-Bell-2024-test

Here is the error I'm getting on my current branch: ld.lld: error: undefined symbol: _gfortran_erfc_scaled_r8

> grep "Error" /lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/e3sm.bldlog.240105-120429 -B10
ld.lld: error: undefined symbol: _gfortran_erfc_scaled_r8
>>> referenced by The Cpu Module
>>>               Tracer1beckBGCReactionsType.F90.o:(calc_bgc_reaction$tracer1beckbgcreactionstype_) in archive ../lnd/liblnd.a
>>> referenced by The Cpu Module
>>>               Tracer1beckBGCReactionsType.F90.o:(calc_bgc_reaction$tracer1beckbgcreactionstype_) in archive ../lnd/liblnd.a
>>> referenced by The Cpu Module
>>>               Tracer1beckBGCReactionsType.F90.o:(calc_bgc_reaction$tracer1beckbgcreactionstype_) in archive ../lnd/liblnd.a
>>> referenced 3 more times
clang-15: error: linker command failed with exit code 1 (use -v to see invocation)
Target /lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/e3sm.exe built in 23.205038 seconds
gmake[2]: *** [cmake/cpl/CMakeFiles/e3sm.exe.dir/build.make:569: /lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/e3sm.exe] Error 1
gmake[2]: Leaving directory '/lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/cmake-bld'
gmake[1]: *** [CMakeFiles/Makefile2:717: cmake/cpl/CMakeFiles/e3sm.exe.dir/all] Error 2
gmake[1]: Leaving directory '/lustre/orion/cli115/proj-shared/hannah6/e3sm_scratch/frontier/tests/SMS_Ln3.ne4pg2_ne4pg2.F2010-MMF1.frontier_crayclanggpu.20240105_120034_mkfnqo/bld/cmake-bld'
gmake: *** [Makefile:94: all] Error 2

whannah1 avatar Jan 05 '24 17:01 whannah1

Stating the obvious: this is trying to link in a GNU symbol while building with Cray compilers?

sarats avatar Jan 05 '24 17:01 sarats

I'm not sure why building with OpenMP would trigger link error with this function: https://fortranwiki.org/fortran/show/erfc_scaled

sarats avatar Jan 05 '24 17:01 sarats

@jgfouca had linking errors that turned out to be from the ulimit in his environment being to low.

rljacob avatar Jan 05 '24 17:01 rljacob

Btw, the SCREAM branches were created to mimic the Frontier configuration used in E3SM/main but perhaps now involve a few new things as software configurations changed. To summarize, there shouldn't be any radical changes but you need to carefully review the edits you are bringing in by merging the scream/cess branch.

Merging the scream branch into your branch is probably bringing in a whole lot of commits which are not needed. Just cherry-pick machine changes that are needed.

sarats avatar Jan 05 '24 17:01 sarats

I think the symbol and gfortran depency is coming from libcraymath, which requires libgfortran. We forward some math functions to it. I'm surprised it's not already getting linked in, though.

abbotts avatar Jan 05 '24 17:01 abbotts

Perhaps adding the -craype-verbose flag to the link line would show us if it's missing

sarats avatar Jan 05 '24 17:01 sarats

@whannah1 , I was able to get past that particular symbol by adding -lgfortran to the link line. I also had to add -lmpifort. I am still struggling with a couple more missing symbols (_dgemm_ and _idmax_).

jgfouca avatar Jan 05 '24 18:01 jgfouca

dgemm and idmax are from BLAS, do you have either libsci (preferred) or reference blas (netlib) linked?

sarats avatar Jan 05 '24 18:01 sarats

@sarats , I was trying to use openblas, but the cess_production branch did need the openblas module, so something is wrong. I'm looking at it.

jgfouca avatar Jan 05 '24 18:01 jgfouca

I don't understand why we need Openblas instead of libsci on a machine like Frontier?

sarats avatar Jan 05 '24 18:01 sarats

@sarats , thanks for the tip, -lsci_cray did resolve those missing symbols. I will try to redo the build without openblas.

jgfouca avatar Jan 05 '24 18:01 jgfouca

@jgfouca Can you point me to your branch so I can see your changes?

@sarats good point about the unnecessary changes, I considered doing the merge "by hand" one file at a time when I was hit with a bunch of merge conflicts, but I figured it was quicker to merge in those unnecessary changes instead.

whannah1 avatar Jan 05 '24 18:01 whannah1

For future reference, if you can't use libsci for whatever reason, use AMD's libraries for AMD CPUs which is based on BLIS, a decent open source alternative.

https://www.amd.com/en/developer/aocl/dense.html

sarats avatar Jan 05 '24 18:01 sarats

@whannah1 You probably need a few module updates in config_machines and maybe some minor things in cmake macros. At least, it would make it easy for me to review. Don't bring in other stuff unless you are really sure it's needed.

sarats avatar Jan 05 '24 18:01 sarats

ok, I'm gonna go back to my basic branch before any machine changes and make some incremental changes based on @jgfouca 's branch

whannah1 avatar Jan 05 '24 18:01 whannah1

@whannah1 , I think I am very close to having a working build if you can wait like 30m.

jgfouca avatar Jan 05 '24 18:01 jgfouca

@jgfouca is this fixed on E3SM master?

rljacob avatar Jan 09 '24 18:01 rljacob

@rljacob , no, it's fixed in SCREAM.

jgfouca avatar Jan 09 '24 19:01 jgfouca