CTSM icon indicating copy to clipboard operation
CTSM copied to clipboard

`nvhpc` compiler tests are failing on cheyenne/derecho

Open ekluzek opened this issue 3 years ago • 7 comments

Brief summary of bug

MPI tests with DEBUG on are failing at runtime with the nvhpc compiler on cheyenne. This continues in ctsm5.1.dev155-38-g5c8f17b1a (derecho1 branch) on derecho

General bug information

CTSM version you are using: ctsm5.1.dev082 in cesm2_3_alpha08d

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: tests with nvhpc and DEBUG on

Details of bug

These tests fail:

SMS_D.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart SMS_D.f45_f45_mg37.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef SMS_D_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default

While DEBUG off tests PASS:

SMS.f19_g17.IHistClm50Bgc.cheyenne_nvhpc.clm-decStart SMS_Ld1.f10_f10_mg37.I1850Clm50Sp.cheyenne_nvhpc.clm-default

As well as mpi-serial tests:

SMS_D_Ld1_Mmpi-serial.1x1_brazil.I2000Clm50SpRs.cheyenne_nvhpc.clm-default SMS_D_Ld1_Mmpi-serial.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default SMS_D_Mmpi-serial.1x1_brazil.I2000Clm50FatesRs.cheyenne_nvhpc.clm-FatesColdDef SMS_D_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default SMS_Mmpi-serial.1x1_brazil.IHistClm50BgcQianRs.cheyenne_nvhpc.clm-default

Important details of your setup / configuration so we can reproduce the bug

Important output or errors that show the problem

For the smallest case: SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default

The only log file available is the cesm.log file as follows.

cesm.log file:

 (t_initf)       profile_single_file=       F
 (t_initf)       profile_global_stats=      T
 (t_initf)       profile_ovhd_measurement=  F
 (t_initf)       profile_add_detail=        F
 (t_initf)       profile_papi_enable=       F
[r12i4n4:35002:0:35002] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35003:0:35003] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35004:0:35004] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35006:0:35006] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35007:0:35007] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35008:0:35008] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35010:0:35010] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35011:0:35011] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35012:0:35012] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35013:0:35013] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35014:0:35014] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35015:0:35015] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35017:0:35017] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35018:0:35018] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35019:0:35019] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35020:0:35020] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35022:0:35022] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35000:0:35000] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35001:0:35001] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
[r12i4n4:35016:0:35016] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x4f1b591)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 21 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==== backtrace (tid:  35022) ====
 0  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(ucs_handle_error+0xe4) [0x2ba9d97301a4]
 1  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a4cc) [0x2ba9d97304cc]
 2  /glade/u/apps/ch/opt/ucx/1.11.0/lib/libucs.so.0(+0x2a73b) [0x2ba9d973073b]
 3  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6LogErr13MsgFoundErrorEiPKciS2_S2_Pi+0x34) [0x2ba9b78f4c74]
 4  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap22meshcreatenodedistgridEPi+0x7f) [0x2ba9b7b15ebf]
 5  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatenodedistgrid_+0xc1) [0x2ba9b7b61141]
 6  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshaddelements_+0xbc0) [0x2ba9b881c880]
 7  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromunstruct_+0x4d0f) [0x2ba9b88246cf]
 8  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreatefromfile_+0x270) [0x2ba9b881f270]
 9  /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x15d8fd0]
10  /glade/scratch/erik/SMS_D_Ld1_P25x1.5x5_amazon.I2000Clm50SpRs.cheyenne_nvhpc.clm-default.GC.cesm2_3_alpha8achlist/bld/cesm.exe() [0x632341]
11  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc30) [0x2ba9b77436b0]
12  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ba9b773e913]
13  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ba9b7f7b9fb]
14  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ba9b7fa3bbe]
15  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ba9b773edd3]
16  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xa26) [0x2ba9b82d2c66]
17  /glade/p/cesmdata/cseg/PROGS/esmf/8.3.0b05/openmpi/4.1.1/nvhpc/21.11/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ba9b85a5ede]

ekluzek avatar Apr 30 '22 21:04 ekluzek

Updating to ccs_config_cesm0.0.65 via #2000 now results in all the nvhpc tests on cheyenne failing run times. It is expected that updating to cesm2_3_beta15 will resolve this.

glemieux avatar Aug 09 '23 16:08 glemieux

In the CESM3_dev branch two of the tests now PASS:

SMS.f10_f10_mg37.I2000Clm50BgcCrop.cheyenne_nvhpc.clm-crop FAILED PREVIOUSLY SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.cheyenne_nvhpc.clm-FatesColdSatPhen FAILED PREVIOUSLY

While this one still fails, but now with a floating point exception

SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop EXPECTED

The cesm.log file shows that there is a problem in ESMF at initialization in creating an ESMF mesh. It doesn't drop PET files by default in this case...

cesm.log:

[1,0]<stderr>: (t_initf)       profile_papi_enable=       F
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>:          912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>:          -18
[1,0]<stdout>: /glade/work/erik/ctsm_worktrees/cesm3_dev/share/src/shr_file_mod.F90
[1,0]<stdout>:          912 This routine is depricated - use shr_log_setLogUnit instead
[1,0]<stdout>:          -25
[1,0]<stderr>:[r3i7n18:45933:0:45933] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,36]<stderr>:[r3i7n33:33507:0:33507] Caught signal 8 (Floating point exception: floating-point invalid operation)
[1,0]<stderr>:==== backtrace (tid:  45933) ====
[1,0]<stderr>: 0  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(ucs_handle_error+0x134) [0x2ae710b0fd74]
[1,0]<stderr>: 1  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e0dc) [0x2ae710b100dc]
[1,0]<stderr>: 2  /glade/u/apps/ch/opt/ucx/1.12.1/lib/libucs.so.0(+0x2e463) [0x2ae710b10463]
[1,0]<stderr>: 3  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_simple_grouping+0xe4) [0x2ae71fa93a64]
[1,0]<stderr>: 4  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmca_common_ompio.so.41(mca_common_ompio_set_view+0x937) [0x2ae71fa9c877]
[1,0]<stderr>: 5  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/openmpi/mca_io_ompio.so(mca_io_ompio_file_set_view+0xc7) [0x2ae720cf2347]
[1,0]<stderr>: 6  /glade/u/apps/ch/opt/openmpi/4.1.4/nvhpc/22.2/lib/libmpi.so.40(PMPI_File_set_view+0x1a4) [0x2ae6f30a68e4]
[1,0]<stderr>: 7  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_file_set_view+0x161) [0x2ae6f034d4a1]
[1,0]<stderr>: 8  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e28e2) [0x2ae6f032b8e2]
[1,0]<stderr>: 9  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e1469) [0x2ae6f032a469]
[1,0]<stderr>:10  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32e02c6) [0x2ae6f03292c6]
[1,0]<stderr>:11  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(+0x32df9d2) [0x2ae6f03289d2]
[1,0]<stderr>:12  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_wait+0x9f) [0x2ae6f032855f]
[1,0]<stderr>:13  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpio_get_varn+0x9f) [0x2ae6f032781f]
[1,0]<stderr>:14  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ncmpi_get_varn_all+0x2d7) [0x2ae6f02be097]
[1,0]<stderr>:15  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x1a758be]
[1,0]<stderr>:16  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe(PIOc_read_darray+0x413) [0x1a72c53]
[1,0]<stderr>:17  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z37get_numElementConn_from_ESMFMesh_fileiiPcxiPxRPi+0x48e) [0x2ae6ee1c7d8e]
[1,0]<stderr>:18  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z42get_elemConn_info_2Dvar_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x99) [0x2ae6ee1c9c19]
[1,0]<stderr>:19  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36get_elemConn_info_from_ESMFMesh_fileiiPcxiPiRiRS0_S2_+0x28c) [0x2ae6ee1caa4c]
[1,0]<stderr>:20  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z36ESMCI_mesh_create_from_ESMFMesh_fileiPcb18ESMC_CoordSys_FlagPN5ESMCI8DistGridEPPNS1_4MeshE+0x63a) [0x2ae6ee6bc87a]
[1,0]<stderr>:21  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_Z27ESMCI_mesh_create_from_filePc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS_PN5ESMCI8DistGridES5_PPNS3_4MeshEPi+0x2eb) [0x2ae6ee6bb8eb]
[1,0]<stderr>:22  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI7MeshCap21meshcreatefromfilenewEPc20ESMC_FileFormat_Flagbb18ESMC_CoordSys_Flag17ESMC_MeshLoc_FlagS1_PNS_8DistGridES6_Pi+0x99) [0x2ae6ee675919]
[1,0]<stderr>:23  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_meshcreatefromfile_+0x1a7) [0x2ae6ee6c51a7]
[1,0]<stderr>:24  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_meshmod_esmf_meshcreat[1,0]<stderr>:efromfile_+0x217) [0x2ae6ef401fd7]
[1,0]<stderr>:25  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x17668d1]
[1,0]<stderr>:26  /glade/scratch/erik/tests_ctsm51d145cesm3n3acl/SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.cheyenne_nvhpc.clm-crop.GC.ctsm51d145cesm3n3acl_nvh/bld/cesm.exe() [0x61af01]
[1,0]<stderr>:27  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:28  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:29  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:30  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:31  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:32  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:33  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]
[1,0]<stderr>:34  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_loopmodelcompss_+0x1036) [0x2ae6ef8ad876]
[1,0]<stderr>:35  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(nuopc_driver_initializeipdv02p3_+0x2208) [0x2ae6ef89fcc8]
[1,0]<stderr>:36  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI6FTable12callVFuncPtrEPKcPNS_2VMEPi+0xc3c) [0x2ae6ee25633c]
[1,0]<stderr>:37  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(ESMCI_FTableCallEntryPointVMHop+0x293) [0x2ae6ee251953]
[1,0]<stderr>:38  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI3VMK5enterEPNS_7VMKPlanEPvS3_+0xbb) [0x2ae6eeaf82fb]
[1,0]<stderr>:39  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(_ZN5ESMCI2VM5enterEPNS_6VMPlanEPvS3_+0xbe) [0x2ae6eeb2237e]
[1,0]<stderr>:40  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(c_esmc_ftablecallentrypointvm_+0x393) [0x2ae6ee251e13]
[1,0]<stderr>:41  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_compmod_esmf_compexecute_+0xab0) [0x2ae6eee59870]
[1,0]<stderr>:42  /glade/p/cesmdata/cseg/PROGS/esmf/8.5.0/openmpi/4.1.4/nvhpc/22.2/lib/libg/Linux.nvhpc.64.openmpi.default/libesmf.so(esmf_gridcompmod_esmf_gridcompinitialize_+0x1de) [0x2ae6ef13f35e]

ekluzek avatar Nov 07 '23 08:11 ekluzek

Seeing similar errors on Derecho:

These PASS: SMS.f10_f10_mg37.I2000Clm50BgcCrop.derecho_nvhpc.clm-crop SMS.f45_f45_mg37.I2000Clm51FatesSpRsGs.derecho_nvhpc.clm-FatesColdSatPhen

These FAIL: ERP_D_P128x2_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default ERS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default SMS_D.f10_f10_mg37.I2000Clm51BgcCrop.derecho_nvhpc.clm-crop SMS_D_Ld1_Mmpi-serial.f45_f45_mg37.I2000Clm50SpRs.derecho_nvhpc.clm-ptsRLA SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default

The fails are all in the build now with error message from FATES code like this:

Lowering Error: symbol hlm_pft_map$sd is an inconsistent array descriptor
NVFORTRAN-F-0000-Internal compiler error. Errors in Lowering       1  (/glade/work/erik/ctsm_worktrees/external_updates/src/fates/main/EDPftvarcon.F90: 2191)
NVFORTRAN/x86-64 Linux 23.5-0: compilation aborted
gmake: *** [/glade/derecho/scratch/erik/tests_ctsm51d155derechoacl/SMS_D_Ld3.f10_f10_mg37.I1850Clm50BgcCrop.derecho_nvhpc.clm-default.GC.ctsm51d155derechoacl_nvh/Tools/Makefile:978: EDPftvarcon.o] Error 2
gmake: *** Waiting for unfinished jobs....

Looking at the code I don't see an obvious problem. I googled about it and there are some NVIDIA nvhpc reports about these kind of errors. But, it's not obvious what the issue is here or how to fix it.

ekluzek avatar Dec 01 '23 06:12 ekluzek

A reminder that nvhpc is important for the flexibility to be able to start using GPU's, and since Derecho has NVIDIA GPU's nvhpc is going to be the most performant compiler on Derecho for it's GPU's.

Even though GPU's don't currently look like they are important for most uses of CTSM. This will be important for ultra high resolution. And in the future as hardware changes it's important to have flexibility in the model to take advantage of different types of hardware in order to keep the model working well.

ekluzek avatar Apr 16 '24 17:04 ekluzek

Corrected that Derecho has NVIDIA GPU's.. And from talking with @sherimickelson and slides presented by her group on Sep/12th/2023 CSEG meeting, nvhpc and cray compilers work for the Derecho GPU's, but intel-oneapi wasn't at the time.

ekluzek avatar Apr 16 '24 17:04 ekluzek

We talked about this in the CSEG meeting. The takeaways are:

Jim feels that we do want to test with NVHPC, so that we know if things start failing. If we need to write a bug report, we can do that, and then move on. Brian: agrees that testing with it is good, but supporting nvhpc shouldn’t be a requirement for CESM3.

ekluzek avatar Apr 24 '24 02:04 ekluzek

This is great news and thanks, @ekluzek for sharing this and for your support.

sherimickelson avatar Apr 24 '24 14:04 sherimickelson