ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

Compile errors on orion and hera with develop

Open JessicaMeixner-NOAA opened this issue 2 years ago • 21 comments

Description

When running ufs-weather-model develop branch (hash, e6da626e086ecd063621278062d8e909c34a6a00) I get a failure for most of the comile jobs on orion (@pjpegion and others have gotten similar errors) and for compile 011 on hera (@MatthewMasarik-NOAA gets the same errror).

To Reproduce:

Check out the develop branch, run ./rt.sh -e (from ecflow server on hera).

Additional context

I know that the orion develop worked for me last week. I have not tried to back-track versions yet as I'm curious if this is a larger issue.

Output

Orion: Code on orion is here: /work2/noaa/marine/jmeixner/ufs-develop/tests rt dir: /work2/noaa/marine/jmeixner/stmp/jmeixner/FV3_RT/rt_445868

Main error is not being able to find crtm:

CMake Error at FV3/upp/CMakeLists.txt:48 (find_package):
  By not providing "Findcrtm.cmake" in CMAKE_MODULE_PATH this project has
  asked CMake to find a package configuration file provided by "crtm", but
  CMake did not find one.

Hera: code: /scratch1/NCEPDEV/climate/Jessica.Meixner/ufs-weather-model/tests rt dir: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239

Main error:

Found Python: /apps/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.7.6-gi3efxgcxqilpjehkqnxrriedsuedoqu/bin/python3.7
Calling CCPP code generator (ccpp_prebuild.py) for all available suites ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239/compile_011/build_fv3_011/FV3/ccpp/physics/ccpp_static_api.F90(5012): error #6405: The same named entity from different modules and/or program units cannot be referenced.   [CDATA]
               ierr = FV3_GFS_v16_coupled_p8_sfcocn_time_vary_tsfinal_cap(cdata=cdata)
--------------------------------------------------------------------------------^
compilation aborted for /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239/compile_011/build_fv3_011/FV3/ccpp/physics/ccpp_static_api.F90 (code 1)
make[2]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/physics/ccpp_static_api.F90.o] Error 1
make[1]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/all] Error 2

JessicaMeixner-NOAA avatar Oct 05 '22 17:10 JessicaMeixner-NOAA

@jkbk2004 I have retested on both hera and orion and am still having this same issue.

JessicaMeixner-NOAA avatar Oct 06 '22 14:10 JessicaMeixner-NOAA

Can confirm the same error on Orion when I call tests/compile.sh directly to attempt build for global-workflow.

WalterKolczynski-NOAA avatar Oct 08 '22 07:10 WalterKolczynski-NOAA

@JessicaMeixner-NOAA I reset the permission of the whole hpc-stack directory on orion. Can you give a try?

jkbk2004 avatar Oct 10 '22 12:10 jkbk2004

Hi @jkbk2004, @JessicaMeixner-NOAA is away until Wednesday. I can test on orion and let you know the outcome.

MatthewMasarik-NOAA avatar Oct 11 '22 02:10 MatthewMasarik-NOAA

@jkbk2004 I just tested on orion and found I get the same error.

MatthewMasarik-NOAA avatar Oct 11 '22 12:10 MatthewMasarik-NOAA

@MatthewMasarik-NOAA @BrianCurtis-NOAA @ChunxiZhang-NOAA @zach1221 can you take a look: /work/noaa/epic-ps/jongkim/4debug? As err.log shows, I am able to load modules ok: crtm. Can you give a try to run the jobs_card I put there? so that we can catch if module loading is ok with everyone.

jkbk2004 avatar Oct 12 '22 12:10 jkbk2004

@jkbk2004 I copied that directory and submitted the job_card. Here is the output of err.log (out.log is empty):

[matma@Orion-login-1 4debug]$ cat err.log 
++ date +%s
+ echo -n ' 1665579328,'
+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"ufs_common"

Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
  $ module --ignore-cache load "ufs_common"

Also make sure that all modulefiles written in TCL start with the string
#%Module

Ps, I don't have account=nems so I set account=marine-cpu.

MatthewMasarik-NOAA avatar Oct 12 '22 13:10 MatthewMasarik-NOAA

make sure you module use modulefiles && module load ufs_<machine>.<compiler> (i.e. ufs_hera.intel)

BrianCurtis-NOAA avatar Oct 12 '22 13:10 BrianCurtis-NOAA

make sure you module use modulefiles && module load ufs_<machine>.<compiler> (i.e. ufs_hera.intel)

Is this message for me, @BrianCurtis-NOAA?

If I try that, module use modulefiles && module load ufs_orion.intel, in the directory I copied I get an Lmod error saying "ufs_orion.intel" is unkown

MatthewMasarik-NOAA avatar Oct 12 '22 13:10 MatthewMasarik-NOAA

@BrianCurtis-NOAA @jkbk2004 @MatthewMasarik-NOAA I am back from leave and can try this again today. Brian I had a question about the module load you said we should do because I don't have to do this for other machines and I've never had to do this for orion before.

JessicaMeixner-NOAA avatar Oct 12 '22 13:10 JessicaMeixner-NOAA

@JessicaMeixner-NOAA @MatthewMasarik-NOAA @jkbk2004 Sorry for the confusion. I had an issue on orion that I found when testing a build on Orion with develop branch. I had to load git/2.28.0 before git would pull everything cleanly without error.

More context for the module use and module load, here's how I setup my env for running RT.

git clone [email protected]:ufs-community/ufs-weather-model --recursive
cd ufs-weather-model
module use modulefiles
module load ufs_<machine>.<compiler>
cd tests
./rt.sh -e > rt.out 2>&1 &

on Orion, at least, a module load git/2.28.0 helped git pull successfully in case that was an issue you saw as well.

BrianCurtis-NOAA avatar Oct 12 '22 13:10 BrianCurtis-NOAA

I've tried loading the git module last week and that did not solve my issue either. I've never had to load the ufs modules for any other machine...

JessicaMeixner-NOAA avatar Oct 12 '22 14:10 JessicaMeixner-NOAA

I've tried loading the git module last week and that did not solve my issue either. I've never had to load the ufs modules for any other machine...

rt.sh should automatically do it, yes.

BrianCurtis-NOAA avatar Oct 12 '22 14:10 BrianCurtis-NOAA

I was able to run on orion this morning with the latest ufs version. I'll try hera now.

JessicaMeixner-NOAA avatar Oct 12 '22 15:10 JessicaMeixner-NOAA

Compile 11 on hera is still failing for me

JessicaMeixner-NOAA avatar Oct 12 '22 18:10 JessicaMeixner-NOAA

Same thing for me, it fails in compiling with DEBUG option on Hera.

Found Python: /apps/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.7.6-gi3efxgcxqilpjehkqnxrriedsuedoqu/bin/python3.7
Calling CCPP code generator (ccpp_prebuild.py) for all available suites ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
/scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_260589/compile_001/build_fv3_001/FV3/ccpp/physics/ccpp_static_api.F90(5012): error #6405: The same named entity from different modules and/or program units cannot be referenced.   [CDATA]
               ierr = FV3_GFS_v16_coupled_p8_sfcocn_time_vary_tsfinal_cap(cdata=cdata)
--------------------------------------------------------------------------------^
compilation aborted for /scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_260589/compile_001/build_fv3_001/FV3/ccpp/physics/ccpp_static_api.F90 (code 1)
make[2]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/physics/ccpp_static_api.F90.o] Error 1
make[1]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/all] Error 2
make: *** [all] Error 2
'''
I cloned fresh copy of ufs_weather_model.

RatkoVasic-NOAA avatar Oct 19 '22 16:10 RatkoVasic-NOAA

I tried again today and the compile 011 is still failing for me on hera. I have been okay on orion and was going to test that again but there are /work issues.

JessicaMeixner-NOAA avatar Oct 19 '22 16:10 JessicaMeixner-NOAA

@JessicaMeixner-NOAA One option is to try to reduce the number of ccpp SDFs in the FV3/ccpp/suites directory to only those actually used by the regression test. This test that is failing does not explicitly list suites, so it tries to build them all. Currently there are more than 90 suite definitions there. Not all of them are used, we use (regression test) only about a third. Can you run this script: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/suites_run.sh in FV3/ccpp/suites directory in your working copy, and try to rerun that test again.

DusanJovic-NOAA avatar Oct 19 '22 22:10 DusanJovic-NOAA

I just ran with 90 (out of 91 SDFs) and it worked. I just excluded first one on the list (suite_FV3_CPT_v0.xml). Still have no idea why this is happening, and only to few of us.

RatkoVasic-NOAA avatar Oct 19 '22 23:10 RatkoVasic-NOAA

suite_FV3_CPT_v0.xml is a deprecated SDF. To reduce the number of SDFs in the suites directory is a good option. And could make it happen soon.

ChunxiZhang-NOAA avatar Oct 20 '22 14:10 ChunxiZhang-NOAA

@DusanJovic-NOAA - running with your script first, the regression tests succeeded. I'm with @RatkoVasic-NOAA on the wondering why this is happening to a few of us.

JessicaMeixner-NOAA avatar Oct 20 '22 19:10 JessicaMeixner-NOAA

git/28 module requirement (on orion) is case-by-case. If there is git clone issue, the problem is resolved clearly with new version. I am closing this issue. If the issue is persistent, we can re-open the issue.

jkbk2004 avatar Mar 09 '23 12:03 jkbk2004