ufs-weather-model
ufs-weather-model copied to clipboard
Has anyone run ufs-weather-model on TACC Stampede2 lately?
Description
I am having trouble running the current ufs-weather-model code on Stampede2. It crashes immediately on startup, doesn't even get to the point where it writes the PET*.ESMF_LogFile
files:
c455-083[knl](1020)$ ibrun -n 8 ./fv3.exe
TACC: Starting up job 9719041
TACC: Starting parallel tasks...
[0] MPI startup(): I_MPI_OFA_ADAPTER_NAME variable has been removed from the product, its value is ignored
[0] MPI startup(): I_MPI_EXTRA_FILESYSTEM_LIST environment variable is not supported.
[0] MPI startup(): Similar variables:
I_MPI_EXTRA_FILESYSTEM
I_MPI_EXTRA_FILESYSTEM_FORCE
[0] MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
forrtl: severe (168): Program Exception - illegal instruction
Image PC Routine Line Source
fv3.exe 000000003770B0AB Unknown Unknown Unknown
libpthread-2.17.s 00002B1B1C435630 Unknown Unknown Unknown
fv3.exe 0000000000E601AA _ZN5ESMCI3VMK4ini 556 ESMCI_VMKernel.C
fv3.exe 0000000001D49D0F _ZN5ESMCI2VM10ini 3113 ESMCI_VM.C
fv3.exe 0000000000E07DD2 c_esmc_vminitiali 1151 ESMCI_VM_F.C
fv3.exe 00000000015F2398 esmf_vmmod_mp_esm 9265 ESMF_VM.F90
fv3.exe 00000000004C5EF3 esmf_initmod_mp_e 602 ESMF_Init.F90
fv3.exe 00000000004C4DD6 esmf_initmod_mp_e 321 ESMF_Init.F90
fv3.exe 000000000042833B MAIN__ 93 UFS.F90
fv3.exe 0000000000427F92 Unknown Unknown Unknown
libc-2.17.so 00002B1B1CD80555 __libc_start_main Unknown Unknown
fv3.exe 0000000000427EA9 Unknown Unknown Unknown
This is with esmf/8.3.0b09
.
Since it crashes so early in the program execution, I would try to run simple 'Hello ESMF World' type of program with just ESMF_Initialize followed by ESMF_Finalize. Or simply #ifdef everything between these two calls in UFS.F90. If you are using UFS.F90 I would also remove/comment out USE module_EARTH_GRID_COMP.
Does Stampede support AVX2?
The coupled prototype p7c was run during the second half of 2021. It was built without avx2, although I don't see why it cannot be, given that CPUs are very similar to Hera's.
Since it crashes so early in the program execution, I would try to run simple 'Hello ESMF World' type of program with just ESMF_Initialize followed by ESMF_Finalize. Or simply #ifdef everything between these two calls in UFS.F90. If you are using UFS.F90 I would also remove/comment out USE module_EARTH_GRID_COMP.
Thanks for that suggestion, Dusan. I did that, same error. Next I'll try to go back to ESMF 8.2.0 and a corresponding earlier version of ESMF before trying other flags.
Update (@kgerheiser FYI). ESMF 8.2.0 didn't help either. I compiled ufs in debug mode, so no AVX2.
But I found the reason: login nodes are skylake processors, default develop queue is KNL on Stampede2. If I use a skylake compute node, the code runs :-) That's the good news. The bad news is that something in the pre-compiled libraries (likely ESMF itself) uses nasty flags like -xHOST
.
We don't have the problem with hpc-stack or jedi-stack, so I'll need to look into the build options of the ESMF package in spack. I guess that one was contributed by EMC or someone else in the first place? Not me. I took ownership in the meanwhile ;-)
First thing I noted is that jedi-/hpc-stack use ESMF_BOPT=O and ESMF_OPTLEVEL=2, whereas the spack package just sets ESMF_BOPT=O and uses whatever the default is for ESMF_OPTLEVEL (documentation or log doesn't say what the default is). And of course, ESMF compilation doesn't care about standard verbose flags and therefore doesn't say how it is compiling ...
This is with Spack libraries? Maybe something with target=skylake
? And then trying to run it on a different architecture without the same instructions? You could trying setting the target=x86_64
You can set it in packages.yaml
https://spack.readthedocs.io/en/latest/build_settings.html#package-preferences
Also this option in concretizer.yaml: https://spack.readthedocs.io/en/latest/build_settings.html#concretizer-options
You can set it to target a generic architecture (x86_64
) rather than a specific (skylake
, haswell
). Though, our fork does not have have this yet.
This is with Spack libraries? My something with the
target=skylake
? And then trying to run it on a different architecture without the same instructions? You could trying setting thetarget=x86_64
That's a very good point. I'll have to try this again. Right now the target is set as any
and the specs have
arch=linux-centos7-skylake_avx512
Weird though, because I did compile the fv3-jedi-bundle-env before and I thought that all ran fine. Check again next week!
Quick update. If I change the target to x86_64
, ESMF is built incorrectly because it decides to use mpicxx
instead of mpiicpc
, i.e. it doesn't use the correct MPI wrappers. Trying an Intel specific target that is older than skylake_avx512
next.
Closing this since it is not a priority.