ufs-weather-model
ufs-weather-model copied to clipboard
Updating hpc-stack modules and miniconda locations for Hera, Gaea, Cheyenne, Orion, Jet
Description
Update the locations of the hpc-stack modules and miniconda3 for compiling and running the UFS-weather-model on NOAA HPC systems, such as Hera, Gaea, Cheyenne, Orion, Jet. The modules are installed under role.epic account and placed in a common EPIC-managed space on each system. Gaea also uses the Lmod installed locally in the same common location (ufs-srweather-app/PR-352, ufs-weather-app/PR-353), and needs to run a script to initialize Lmod before loading a modulefile ufs_gaea.intel.lua. While ufs-weather model may use/require python to a lesser extent, the UFS-srweather-app relies heavily on conda environment.
For ease of maintenance of the libraries on the NOAA HPC systems, transition to new location of the modules built for both ufs-weather-model and ufs-srweather-app is needed.
Solution
Repo of the ufs-weather-model to be updated with the new version of miniconda and hpc libraries.
Udated installation locations have been used to load the modules listed in /ufs-weather-model/modulefiles/ufs_common
and build the ufs model binaries.
Hera gnu compilers includ
UPD. 10/20/2022: Modules for Hera and Jet have been build for the already tested compiler intel/2022.1.2. Modules for the compiler/impi intel/2022.2.0 also remained and could be used when the upgrade is needed.
UPD. 10/24/2022: Modules for Hera gnu compilers (9.2.0, 10.2.0) and different mpich/openmpi combinations, and updated netcdf/4.9.0 have been prepared.
Cheyenne Lmod has been upgraded to v.8.7.13 systemwide after system maintenance on 10/21/2022.
Alternatives
Alternative solutions could be to have the hpc libraries and modules built in separate locations for the ufs-weather-model and ufs-srweather-app. The request from EPIC management, however, was to use a common location for the all the libraries.
Related to
A PR-419 in the ufs-srweather-model already exists, and a new PR will be made to the current repo.
- needed by https://github.com/ufs-community/ufs-srweather-app/pull/419 - marked as
top priority
- needed by https://github.com/ufs-community/ufs-weather-model/pull/???
Updated locations to load the conda/python and hpc-modules and how to load them on all the systems:
Hera python/miniconda :
module use /scratch1/NCEPDEV/nems/role.epic/miniconda3/modulefiles
module load miniconda3/4.12.0
Hera intel/2022.1.2 + impi/2022.1.2 :
module load intel/2022.1.2
module load impi/2022.1.2
use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
Hera intel/2022.1.2 + impi/2022.1.2 + netcdf-c 4.9.0:
module load intel/2022.1.2
module load impi/2022.1.2
use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/intel-2022.1.2_ncdf49/modulefiles/stack
module load hpc/1.2.0
module load hpc-intel/2022.1.2
module load hpc-impi/2022.1.2
Hera gnu/9.2 + mpich/3.3.2 :
module load gnu/9.2
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/9.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2
Hera gnu/10.2 + mpich/3.3.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2
Hera gnu/10.2 + openmpi/4.1.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_openmpi/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load openmpi/4.1.2
module load hpc-openmpi/4.1.2
Hera gnu/9.2 + mpich/3.3.2 + netcdf-c 4.9.0:
module load gnu/9.2
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_ncdf49/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/9.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2
Hera gnu/10.2 + mpich/3.3.2 + netcdf-c/4.9.0:
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2_ncdf49/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2
Gaea miniconda:
:
module use /lustre/f2/dev/role.epic/contrib/modulefiles
module load miniconda3/4.12.0
Gaea intel:
Lmod initialization on Gaea needs to be done first by sourcing the following script:
source/lustre/f2/dev/role.epic/contrib/Lmod_init.sh
module use /lustre/f2/dev/role.epic/contrib/modulefiles module load miniconda3/4.12.0
module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2021.3.0/modulefiles/stack module load hpc/1.2.0 module load intel/2021.3.0 module load hpc-intel/2021.3.0 module load hpc-cray-mpich/7.7.11
Cheyenne miniconda:
module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles
module load miniconda3/4.12.0
Cheyenne intel
:
module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles
module load miniconda3/4.12.0
module use /glade/work/epicufsrt/contrib/hpc-stack/intel2022.1/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1 module load hpc-mpt/2.25
Cheyenne gnu
:
module use /glade/work/epicufsrt/contrib/miniconda3/modulefiles
module load miniconda3/4.12.0
module use /glade/work/epicufsrt/contrib/hpc-stack/gnu11.2.0/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/11.2.0 module load hpc-mpt/2.25
Orion miniconda:
module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles
module load miniconda3/4.12.0
Orion intel:
module use /work/noaa/epic-ps/role-epic-ps/miniconda3/modulefiles
module load miniconda3/4.12.0
module use /work/noaa/epic-ps/role-epic-ps/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2
Jet miniconda:
module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/modulefiles
module load miniconda3/4.12.0
Jet intel
:
module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/miniconda3/modulefiles
module load miniconda3/4.12.0
module use /mnt/lfs4/HFIP/hfv3gfs/role.epic/hpc-stack/libs/intel-2022.1.2/modulefiles/stack module load hpc/1.2.0 module load hpc-intel/2022.1.2 module load hpc-impi/2022.1.2
NB:
There were comments in ufs-weather-app/PR-419 suggesting to roll back to lower compiler versions for Cheyenne gnu (to use 11.2.0 instead of 12.1.0), Hera intel (to use intel/2021.1.2 instead of 2022.2.0), and Jet intel (to use intel/2021.1.2 instead of intel/2022.2.0)
Either way could be OK for the SRW, and the libraries would be built for the lower-version compilers as suggested
@natalie-perlin Can you make sure all compiler and library versions are confirmed against https://github.com/ufs-community/ufs-weather-model/tree/develop/modulefiles ?
@ulmononian can we coordinate about intel/gnu/openmpi to hera on this issue?
@jkbk2004 The PRs have not been made yet to address the changes in modulefiles for the ufs-weather-model, only for the ufs-srweather-app
The modulefiles for Hera and Jet to use the intel/2022.1.2 version, and not the latest 2022.2.0, version have been built. Updating the info in the top comment of this issue.
Can somebody please build the gnu hpc-stack on hera and cheyenne using openmpi. Thanks.
@DusanJovic-NOAA @jkbk2004 here is a build i did in the past w/ gnu-9.2.0 & openmpi-3.1.4 on hera: module use /scratch1/NCEPDEV/stmp2/Cameron.Book/hpcs_work/libs/gnu/stack_noaa/modulefiles/stack
@DusanJovic-NOAA @jkbk2004 here is a build i did in the past w/ gnu-9.2.0 & openmpi-3.1.4 on hera: module use /scratch1/NCEPDEV/stmp2/Cameron.Book/hpcs_work/libs/gnu/stack_noaa/modulefiles/stack
Thanks @ulmononian. I also have the gnu/openmpi stack built in my own space. What I was asking is the installation in officially supported location so that we can update modulefiles in develop branch.
@ulmononian would you please also create an issue hpc-stack on upp repo (https://github.com/noaa-emc/upp). Also other workflow (global workflow, HAFS workflow) may also be impacted by this change. @WenMeng-NOAA @aerorahul @WalterKolczynski-NOAA @KateFriedman-NOAA @BinLiu-NOAA FYI.
@junwang-noaa @ulmononian @WenMeng-NOAA @aerorahul @WalterKolczynski-NOAA @KateFriedman-NOAA @BinLiu-NOAA @natalie-perlin I noticed that Kyle's old stack installations are still used in other applications and some machines. I started a coordination on EPIC side. It may take a week or two to finish the full transition. I want to combine this issue with the other library update follow-ups on-going: netcdf/esmf, etc.
@jkbk2004 Can you install g2tmpl/1.10.2 for the UPP? Thanks!
@jkbk2004 Can you install g2tmpl/1.10.2 for the UPP? Thanks!
@WenMeng-NOAA g2tmpl/1.10.2 is available (current ufs-wm modulefiles) but backward comparability issue was captured at issue #1441.
@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.
The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/
Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).
@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates.
The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/
Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).
@natalie-perlin Is anyone going to provide gnu/openmpi stack?
@DusanJovic-NOAA - hpc-stack with gnu/9.2.0+mpich/3.3.2 and gnu/10.2.0+mpich/3.3.2 have been installed on Hera under role.epic account (EPIC-managed space). Testing them with ufs-weather-model-RTs, and plan to include these Hera-gnu into the module updates. The stack installation locations are: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2/ /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/ Exact modifications to the modulefiles (paths needed for finding all the modules) will be listed in a subsequent PR(s).
@natalie-perlin Is anyone going to provide gnu/openmpi stack?
@ulmononian can you install gnu/openmpi parallel to the location above?
@jkbk2004 - do we need al four possible combinations for compilers (gnu/9.2.0, gnu/10.2.0) with mpich/3.3.2 , openmpi/4.1.2 ?
@jkbk2004 - do we need al four possible combinations for compilers (gnu/9.2.0, gnu/10.2.0) with mpich/3.3.2 , openmpi/4.1.2 ?
@natalie-perlin I think @ulmononian has installed gnu10.1/openmpi. That should be good enough as a starting point for openmpi option. But it makes a sense to set openmpi installation available along with the role account path.
@jkbk2004, @ulmonian - HPC-modules using different versions gnu, mpich and openmpi were installed, plus new versions of netcdf 4.9.0 (netcdf-c/4.9.0, netcdf-fortran/4.6.0, netcdf-cxx-4.3.1) for the following combinations:
gnu/9.2.0 + mpich/3.3.2 + netcdf/4.7.4 gnu/9.2.0 + mpich/3.3.2 + netcdf/4.9.0 gnu/10.2.0 + mpich/3.3.2 + netcdf/4.7.4 gnu/10.2.0 +mpich/3.3.2 + netcdf/4.9.0 gnu/10.2.0 + openmpi/4.1.2 + netcdf/4.7.4
The updates of the stack locations are made in the top comment of this Issue-1465
Added a stack build with the intel compiler and netcdf4.9 on Hera (see the list of locations in the top comment)
@DusanJovic-NOAA @jkbk2004 @natalie-perlin i will install the stack w/ gnu-9.2 and openmpi-3.1.4 here /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs shortly, as well as w/ gnu-10.1 & openmpi-3.1.4 in the official location.
@DusanJovic-NOAA @jkbk2004 @natalie-perlin hpc-stack built w/ gnu-9.2 and openmpi-3.1.4 was installed successfully here: /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4.
I tried running the regression test using gnu-9.2_openmpi-3.1.4 stack but it failed because the debug version of esmf library is missing:
$ module load ufs_hera.gnu_debug
Lmod has detected the following error: The following module(s) are
unknown: "esmf/8.3.0b09-debug"
Please check the spelling or version number. Also try "module spider ..."
It is also possible your cache file is out-of-date; it may help to try:
$ module --ignore_cache load "esmf/8.3.0b09-debug"
$ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4/modulefiles/mpi/gnu/9.2.0/openmpi/3.1.4/esmf/
total 4
-rw-r--r-- 1 role.epic nems 1365 Oct 28 23:20 8.3.0b09.lua
lrwxrwxrwx 1 role.epic nems 12 Oct 28 23:20 default -> 8.3.0b09.lua
I also tried 'gnu-10.2_openmpi' stack, but it looks like when I load it, it does not actually load gnu 10.2 module, I see:
$ module list
Currently Loaded Modules:
1) miniconda3/3.7.3 10) libpng/1.6.37 19) g2tmpl/1.10.0
2) sutils/default 11) hdf5/1.10.6 20) ip/3.3.3
3) cmake/3.20.1 12) netcdf/4.7.4 21) sp/2.3.3
4) hpc/1.2.0 13) pio/2.5.7 22) w3emc/2.9.2
5) hpc-gnu/10.2 14) esmf/8.3.0b09 23) gftl-shared/v1.5.0
6) openmpi/4.1.2 15) fms/2022.01 24) mapl/2.22.0-esmf-8.3.0b09
7) hpc-openmpi/4.1.2 16) bacio/2.4.1 25) ufs_common
8) jasper/2.0.25 17) crtm/2.4.0 26) ufs_hera.gnu
9) zlib/1.2.11 18) g2/3.4.5
note, there is no gnu/10.2 module loaded. When I run gcc I see the compiler is version 4.8.5:
$ gcc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I think this is because, in gnu-10.2_openmpi/modulefiles/core/hpc-gnu/10.2.lua, two lines:
load(compiler)
prereq(compiler)
are missing:
$ cat gnu-10.2_openmpi/modulefiles/core/hpc-gnu/10.2.lua
...
local compiler = pathJoin("gnu",pkgVersion)
local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"
local mpath = pathJoin(opt,"modulefiles/compiler","gnu",pkgVersion)
prepend_path("MODULEPATH", mpath)
...
which are present in:
$ cat gnu-9.2_openmpi-3.1.4/modulefiles/core/hpc-gnu/9.2.0.lua
...
local compiler = pathJoin("gnu",pkgVersion)
load(compiler)
prereq(compiler)
local opt = os.getenv("HPC_OPT") or os.getenv("OPT") or "/opt/modules"
local mpath = pathJoin(opt,"modulefiles/compiler","gnu",pkgVersion)
prepend_path("MODULEPATH", mpath)
...
There is also unnecessary inconsistency in the naming of hpc-gnu module between two versions:
$ ll gnu-9.2_openmpi-3.1.4/modulefiles/core/hpc-gnu/
total 4
-rw-r--r-- 1 role.epic nems 749 Oct 28 22:07 9.2.0.lua
$ ll gnu-10.2_openmpi/modulefiles/core/hpc-gnu/
total 4
-rw-r--r-- 1 role.epic nems 717 Oct 24 12:59 10.2.lua
Why '10.2' and not '10.2.0'? Also the 9.2 stack directory name has openmpi version, while directory for 10.2 stack does not.
I tried running the regression test using gnu-9.2_openmpi-3.1.4 stack but it failed because the debug version of esmf library is missing:
$ module load ufs_hera.gnu_debug Lmod has detected the following error: The following module(s) are unknown: "esmf/8.3.0b09-debug" Please check the spelling or version number. Also try "module spider ..." It is also possible your cache file is out-of-date; it may help to try: $ module --ignore_cache load "esmf/8.3.0b09-debug" $ ls -l /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4/modulefiles/mpi/gnu/9.2.0/openmpi/3.1.4/esmf/ total 4 -rw-r--r-- 1 role.epic nems 1365 Oct 28 23:20 8.3.0b09.lua lrwxrwxrwx 1 role.epic nems 12 Oct 28 23:20 default -> 8.3.0b09.lua
my apologies, @DusanJovic-NOAA i will install esmf/8.3.0b09-debug in /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 now and update you when it is finished. we will also address the inconsistency in naming convention and look into the gnu-10.2 modulefile. thank you for testing w/ these stacks.
@DusanJovic-NOAA the stack at /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 has been updated to include esmf/8.3.0b09-debug. i was able to load ufs_common_debug.lua, so hopefully it works for you now!
@DusanJovic-NOAA, @ulmononian - please note that the GNU 10.2.0 is not installed system-wide on Hera, and only installed locally in EPIC space. It could be build under the current hpc-stack for a particular compiler-gnu-netcdf installation location, but because the compiler is shared between several of such combinations, it is moved to a common location outside a given hpc-stack installation.
Please note that directions to load the compilers and stack given in the first comment address the way the compiler is loaded! For example,
Hera gnu/10.2 + mpich/3.3.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles
module load gnu/10.2.0
module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack
module load hpc/1.2.0
module load hpc-gnu/10.2
module load mpich/3.3.2
module load hpc-mpich/3.3.2
The modulefiles for GNU 10.2.0 had to be manually adjusted to allow a customized location of the gnu/10.2.0 compiler, the path that is only listed when the hpc-stack is being requested to load. The stack would not find a compiler "by default", because the modulepath is not known: it neither the system-wide installation path, nor is under the given hpc-stack combo.
I hope it resolves questions about the use of GNU/10.2.0 compiler!
@DusanJovic-NOAA - as to the questions about the use of 9.2 vs. 9.2.0 or 10.2 vs. 10.2.0 - it is purely by legacy reasons. I did see that previous hpc-stack installations used XX.X abbreviations. However, you do need to give the full version of the compiler, the way it is installed system-wide, which is 9.2.0 in this case. And GNU/10.2.0 was installed in EPIC-space to match the gnu/9.2.0 convention, using XX.X.X. If there is a strong preference to get to the use of XX.X.X (as is system-wide gnu/9.2.0 install), it could relatively easily be done (reinstalled in a new location).
@DusanJovic-NOAA the stack at /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-9.2_openmpi-3.1.4 has been updated to include esmf/8.3.0b09-debug. i was able to load ufs_common_debug.lua, so hopefully it works for you now!
@ulmononian Thanks for adding the debug build of esmf. I ran control and control_debug regression tests, both finished successfully. The control tests outputs are not bit identical to the baseline, contol_debug are identical. I guess this is expected due to different MPI library.
@DusanJovic-NOAA, @ulmononian - please note that the GNU 10.2.0 is not installed system-wide on Hera, and only installed locally in EPIC space. It could be build under the current hpc-stack for a particular compiler-gnu-netcdf installation location, but because the compiler is shared between several of such combinations, it is moved to a common location outside a given hpc-stack installation.
Please note that directions to load the compilers and stack given in the first comment address the way the compiler is loaded! For example,
Hera gnu/10.2 + mpich/3.3.2 :
module use /scratch1/NCEPDEV/nems/role.epic/gnu/modulefiles module load gnu/10.2.0 module use /scratch1/NCEPDEV/nems/role.epic/hpc-stack/libs/gnu-10.2/modulefiles/stack module load hpc/1.2.0 module load hpc-gnu/10.2 module load mpich/3.3.2 module load hpc-mpich/3.3.2
@natalie-perlin I tried to run control and control_debug tests after loading gnu module form the location above (thanks for explaining this, I missed that in the description). The control test compiled successfuly, but failed at run time:
+ sleep 1
+ srun --label -n 160 ./fv3.exe
1: [h12c01:06674] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
90: [h20c56:12037] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
55: [h12c04:153910] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
144: [h21c53:84991] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
....
38: [h12c01:06711] OPAL ERROR: Unreachable in file ../../../../../opal/mca/pmix/pmix3x/pmix3x_client.c at line 112
43: --------------------------------------------------------------------------
43: The application appears to have been direct launched using "srun",
43: but OMPI was not built with SLURM's PMI support and therefore cannot
43: execute. There are several options for building PMI support under
43: SLURM, depending upon the SLURM version you are using:
43:
43: version 16.05 or later: you can use SLURM's PMIx support. This
43: requires that you configure and build SLURM --with-pmix.
43:
43: Versions earlier than 16.05: you must use either SLURM's PMI-1 or
43: PMI-2 support. SLURM builds PMI-1 by default, or you can manually
43: install PMI-2. You must then build Open MPI using --with-pmi pointing
43: to the SLURM PMI library location.
43:
43: Please configure as appropriate and try again.