ufs-weather-model
ufs-weather-model copied to clipboard
Compile errors on orion and hera with develop
Description
When running ufs-weather-model develop branch (hash, e6da626e086ecd063621278062d8e909c34a6a00) I get a failure for most of the comile jobs on orion (@pjpegion and others have gotten similar errors) and for compile 011 on hera (@MatthewMasarik-NOAA gets the same errror).
To Reproduce:
Check out the develop branch, run ./rt.sh -e (from ecflow server on hera).
Additional context
I know that the orion develop worked for me last week. I have not tried to back-track versions yet as I'm curious if this is a larger issue.
Output
Orion: Code on orion is here: /work2/noaa/marine/jmeixner/ufs-develop/tests rt dir: /work2/noaa/marine/jmeixner/stmp/jmeixner/FV3_RT/rt_445868
Main error is not being able to find crtm:
CMake Error at FV3/upp/CMakeLists.txt:48 (find_package):
By not providing "Findcrtm.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "crtm", but
CMake did not find one.
Hera: code: /scratch1/NCEPDEV/climate/Jessica.Meixner/ufs-weather-model/tests rt dir: /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239
Main error:
Found Python: /apps/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.7.6-gi3efxgcxqilpjehkqnxrriedsuedoqu/bin/python3.7
Calling CCPP code generator (ccpp_prebuild.py) for all available suites ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
/scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239/compile_011/build_fv3_011/FV3/ccpp/physics/ccpp_static_api.F90(5012): error #6405: The same named entity from different modules and/or program units cannot be referenced. [CDATA]
ierr = FV3_GFS_v16_coupled_p8_sfcocn_time_vary_tsfinal_cap(cdata=cdata)
--------------------------------------------------------------------------------^
compilation aborted for /scratch1/NCEPDEV/stmp2/Jessica.Meixner/FV3_RT/rt_5239/compile_011/build_fv3_011/FV3/ccpp/physics/ccpp_static_api.F90 (code 1)
make[2]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/physics/ccpp_static_api.F90.o] Error 1
make[1]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/all] Error 2
@jkbk2004 I have retested on both hera and orion and am still having this same issue.
Can confirm the same error on Orion when I call tests/compile.sh directly to attempt build for global-workflow.
@JessicaMeixner-NOAA I reset the permission of the whole hpc-stack directory on orion. Can you give a try?
Hi @jkbk2004, @JessicaMeixner-NOAA is away until Wednesday. I can test on orion and let you know the outcome.
@jkbk2004 I just tested on orion and found I get the same error.
@MatthewMasarik-NOAA @BrianCurtis-NOAA @ChunxiZhang-NOAA @zach1221 can you take a look: /work/noaa/epic-ps/jongkim/4debug? As err.log shows, I am able to load modules ok: crtm. Can you give a try to run the jobs_card I put there? so that we can catch if module loading is ok with everyone.
@jkbk2004 I copied that directory and submitted the job_card. Here is the output of err.log (out.log is empty):
[matma@Orion-login-1 4debug]$ cat err.log
++ date +%s
+ echo -n ' 1665579328,'
+ set +x
Lmod has detected the following error: The following module(s) are unknown:
"ufs_common"
Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore-cache load "ufs_common"
Also make sure that all modulefiles written in TCL start with the string
#%Module
Ps, I don't have account=nems
so I set account=marine-cpu
.
make sure you module use modulefiles && module load ufs_<machine>.<compiler>
(i.e. ufs_hera.intel)
make sure you
module use modulefiles && module load ufs_<machine>.<compiler>
(i.e. ufs_hera.intel)
Is this message for me, @BrianCurtis-NOAA?
If I try that, module use modulefiles && module load ufs_orion.intel
, in the directory I copied I get an Lmod error saying "ufs_orion.intel"
is unkown
@BrianCurtis-NOAA @jkbk2004 @MatthewMasarik-NOAA I am back from leave and can try this again today. Brian I had a question about the module load you said we should do because I don't have to do this for other machines and I've never had to do this for orion before.
@JessicaMeixner-NOAA @MatthewMasarik-NOAA @jkbk2004 Sorry for the confusion. I had an issue on orion that I found when testing a build on Orion with develop branch. I had to load git/2.28.0 before git would pull everything cleanly without error.
More context for the module use and module load, here's how I setup my env for running RT.
git clone [email protected]:ufs-community/ufs-weather-model --recursive
cd ufs-weather-model
module use modulefiles
module load ufs_<machine>.<compiler>
cd tests
./rt.sh -e > rt.out 2>&1 &
on Orion, at least, a module load git/2.28.0
helped git pull successfully in case that was an issue you saw as well.
I've tried loading the git module last week and that did not solve my issue either. I've never had to load the ufs modules for any other machine...
I've tried loading the git module last week and that did not solve my issue either. I've never had to load the ufs modules for any other machine...
rt.sh should automatically do it, yes.
I was able to run on orion this morning with the latest ufs version. I'll try hera now.
Compile 11 on hera is still failing for me
Same thing for me, it fails in compiling with DEBUG option on Hera.
Found Python: /apps/spack/linux-centos7-x86_64/gcc-9.2.0/python-3.7.6-gi3efxgcxqilpjehkqnxrriedsuedoqu/bin/python3.7
Calling CCPP code generator (ccpp_prebuild.py) for all available suites ...
+ OMP_NUM_THREADS=1
+ make -j 8 VERBOSE=1
/scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_260589/compile_001/build_fv3_001/FV3/ccpp/physics/ccpp_static_api.F90(5012): error #6405: The same named entity from different modules and/or program units cannot be referenced. [CDATA]
ierr = FV3_GFS_v16_coupled_p8_sfcocn_time_vary_tsfinal_cap(cdata=cdata)
--------------------------------------------------------------------------------^
compilation aborted for /scratch1/NCEPDEV/stmp2/Ratko.Vasic/FV3_RT/rt_260589/compile_001/build_fv3_001/FV3/ccpp/physics/ccpp_static_api.F90 (code 1)
make[2]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/physics/ccpp_static_api.F90.o] Error 1
make[1]: *** [FV3/ccpp/CMakeFiles/fv3ccpp.dir/all] Error 2
make: *** [all] Error 2
'''
I cloned fresh copy of ufs_weather_model.
I tried again today and the compile 011 is still failing for me on hera. I have been okay on orion and was going to test that again but there are /work issues.
@JessicaMeixner-NOAA One option is to try to reduce the number of ccpp SDFs in the FV3/ccpp/suites directory to only those actually used by the regression test. This test that is failing does not explicitly list suites, so it tries to build them all. Currently there are more than 90 suite definitions there. Not all of them are used, we use (regression test) only about a third. Can you run this script: /scratch2/NCEPDEV/fv3-cam/Dusan.Jovic/suites_run.sh in FV3/ccpp/suites directory in your working copy, and try to rerun that test again.
I just ran with 90 (out of 91 SDFs) and it worked. I just excluded first one on the list (suite_FV3_CPT_v0.xml). Still have no idea why this is happening, and only to few of us.
suite_FV3_CPT_v0.xml is a deprecated SDF. To reduce the number of SDFs in the suites directory is a good option. And could make it happen soon.
@DusanJovic-NOAA - running with your script first, the regression tests succeeded. I'm with @RatkoVasic-NOAA on the wondering why this is happening to a few of us.
git/28 module requirement (on orion) is case-by-case. If there is git clone issue, the problem is resolved clearly with new version. I am closing this issue. If the issue is persistent, we can re-open the issue.