ufs-weather-model icon indicating copy to clipboard operation
ufs-weather-model copied to clipboard

Has anyone run ufs-weather-model on TACC Stampede2 lately?

Open climbfuji opened this issue 2 years ago • 8 comments

Description

I am having trouble running the current ufs-weather-model code on Stampede2. It crashes immediately on startup, doesn't even get to the point where it writes the PET*.ESMF_LogFile files:

c455-083[knl](1020)$ ibrun -n 8 ./fv3.exe
TACC:  Starting up job 9719041
TACC:  Starting parallel tasks...
[0] MPI startup(): I_MPI_OFA_ADAPTER_NAME variable has been removed from the product, its value is ignored

[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM_LIST environment variable is not supported.
[0] MPI startup(): Similar variables:
	 I_MPI_EXTRA_FILESYSTEM
	 I_MPI_EXTRA_FILESYSTEM_FORCE
[0] MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.

forrtl: severe (168): Program Exception - illegal instruction
Image              PC                Routine            Line        Source
fv3.exe            000000003770B0AB  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B1B1C435630  Unknown               Unknown  Unknown
fv3.exe            0000000000E601AA  _ZN5ESMCI3VMK4ini         556  ESMCI_VMKernel.C
fv3.exe            0000000001D49D0F  _ZN5ESMCI2VM10ini        3113  ESMCI_VM.C
fv3.exe            0000000000E07DD2  c_esmc_vminitiali        1151  ESMCI_VM_F.C
fv3.exe            00000000015F2398  esmf_vmmod_mp_esm        9265  ESMF_VM.F90
fv3.exe            00000000004C5EF3  esmf_initmod_mp_e         602  ESMF_Init.F90
fv3.exe            00000000004C4DD6  esmf_initmod_mp_e         321  ESMF_Init.F90
fv3.exe            000000000042833B  MAIN__                     93  UFS.F90
fv3.exe            0000000000427F92  Unknown               Unknown  Unknown
libc-2.17.so       00002B1B1CD80555  __libc_start_main     Unknown  Unknown
fv3.exe            0000000000427EA9  Unknown               Unknown  Unknown

This is with esmf/8.3.0b09.

climbfuji avatar May 25 '22 22:05 climbfuji

Since it crashes so early in the program execution, I would try to run simple 'Hello ESMF World' type of program with just ESMF_Initialize followed by ESMF_Finalize. Or simply #ifdef everything between these two calls in UFS.F90. If you are using UFS.F90 I would also remove/comment out USE module_EARTH_GRID_COMP.

DusanJovic-NOAA avatar May 25 '22 23:05 DusanJovic-NOAA

Does Stampede support AVX2?

DusanJovic-NOAA avatar May 25 '22 23:05 DusanJovic-NOAA

The coupled prototype p7c was run during the second half of 2021. It was built without avx2, although I don't see why it cannot be, given that CPUs are very similar to Hera's.

MinsukJi-NOAA avatar May 25 '22 23:05 MinsukJi-NOAA

Since it crashes so early in the program execution, I would try to run simple 'Hello ESMF World' type of program with just ESMF_Initialize followed by ESMF_Finalize. Or simply #ifdef everything between these two calls in UFS.F90. If you are using UFS.F90 I would also remove/comment out USE module_EARTH_GRID_COMP.

Thanks for that suggestion, Dusan. I did that, same error. Next I'll try to go back to ESMF 8.2.0 and a corresponding earlier version of ESMF before trying other flags.

climbfuji avatar May 26 '22 03:05 climbfuji

Update (@kgerheiser FYI). ESMF 8.2.0 didn't help either. I compiled ufs in debug mode, so no AVX2.

But I found the reason: login nodes are skylake processors, default develop queue is KNL on Stampede2. If I use a skylake compute node, the code runs :-) That's the good news. The bad news is that something in the pre-compiled libraries (likely ESMF itself) uses nasty flags like -xHOST.

We don't have the problem with hpc-stack or jedi-stack, so I'll need to look into the build options of the ESMF package in spack. I guess that one was contributed by EMC or someone else in the first place? Not me. I took ownership in the meanwhile ;-)

First thing I noted is that jedi-/hpc-stack use ESMF_BOPT=O and ESMF_OPTLEVEL=2, whereas the spack package just sets ESMF_BOPT=O and uses whatever the default is for ESMF_OPTLEVEL (documentation or log doesn't say what the default is). And of course, ESMF compilation doesn't care about standard verbose flags and therefore doesn't say how it is compiling ...

climbfuji avatar May 26 '22 20:05 climbfuji

This is with Spack libraries? Maybe something with target=skylake? And then trying to run it on a different architecture without the same instructions? You could trying setting the target=x86_64

You can set it in packages.yaml

https://spack.readthedocs.io/en/latest/build_settings.html#package-preferences

Also this option in concretizer.yaml: https://spack.readthedocs.io/en/latest/build_settings.html#concretizer-options

You can set it to target a generic architecture (x86_64) rather than a specific (skylake, haswell). Though, our fork does not have have this yet.

kgerheiser avatar May 26 '22 23:05 kgerheiser

This is with Spack libraries? My something with the target=skylake? And then trying to run it on a different architecture without the same instructions? You could trying setting the target=x86_64

That's a very good point. I'll have to try this again. Right now the target is set as any and the specs have

arch=linux-centos7-skylake_avx512

Weird though, because I did compile the fv3-jedi-bundle-env before and I thought that all ran fine. Check again next week!

climbfuji avatar May 27 '22 03:05 climbfuji

Quick update. If I change the target to x86_64, ESMF is built incorrectly because it decides to use mpicxx instead of mpiicpc, i.e. it doesn't use the correct MPI wrappers. Trying an Intel specific target that is older than skylake_avx512 next.

climbfuji avatar Jun 01 '22 20:06 climbfuji

Closing this since it is not a priority.

climbfuji avatar Jun 27 '23 02:06 climbfuji