E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

SHR_REPROSUM_CALC errors on Crusher

Open grnydawn opened this issue 2 years ago • 1 comments

A case with F2010 and ne30pg2_r05_oECv3 on Crusher failed with following error message:

SHR_REPROSUM_CALC: Input contains 0.30000E+01 NaNs and 0.00000E+00 INFs on process 3

This error occurs on E3SM builds using each of Cray, AMD, and GNU compilers available on Crusher.

Interestingly, the F2010/ne30pg2_r05_oECv3 case runs successfully on Summit system where input data are shared with Crusher.

The error message seems to come from "E3SM/share/util/shr_reprosum_mod.F90 but it was hard to locate a source code line that this routine is called from.

Does anyone have seen this error?

The branch that I am working on is ykim/crusher/craydebug, a debug branch branched off from a recent master branch.

grnydawn avatar Jun 28 '22 21:06 grnydawn

Just noting that we should try this case with GNU on a x86 machine and see if we encounter this issue.

sarats avatar Jun 29 '22 22:06 sarats

This error is not showing with PrgEnv-cray/8.3.3, PrgEnv-amd/8.3.3, and PrgEnv-gnu/8.3.3. It is not evident if these modules fixed this error or not. It may be possible that a E3SM compiler configuration may cause this error. Because we have no issue now, I will close it and re-open if needed.

grnydawn avatar Oct 04 '22 17:10 grnydawn

Just to document:

  • for runs with GNU, reprosum NaNs went away after adding -fno-inline-arg-packing to eam/src/dynamics/se/inidat.F90 in Depends.[gnu,gnugpu].cmake in PR E3SM-Project/E3SM#5132
  • for runs with Cray, the issue was fixed by adding -hipa0 -hzero to FFLAGS in crayclang_crusher.cmake in E3SM-Project/E3SM#5208

amametjanov avatar Oct 04 '22 20:10 amametjanov