FMS icon indicating copy to clipboard operation
FMS copied to clipboard

Floating point divide-by-zero error in test_mpp_io

Open wrongkindofdoctor opened this issue 4 years ago • 9 comments

Describe the bug test_mpp_io crashes with a floating point divide-by-zero error when a time axis is registered to a netcdf file via mpp_write_meta, which calls nf_def_var. To Reproduce configure the environment on the Skylake or AMD box to run with intel19/20, run make distcheck, be c

Expected behavior The time axis is registered to the file and the test runs successfully System Environment Describe the system environment, include:

  • OS: CENTOS8 (AMD), CENTOS7(Skylake)
  • Compiler(s): intel 19/20
  • MPI type, and version: impi2020_up2 (AMD), impi2020_up5 (Skylake)
  • netCDF Version: netcdf 4.6.1
  • Configure options: FCFLAGS O0 -g -traceback -check all -check noarg_temp_created -check nopointer -nowarn -ftz -auto -safe-cray-ptr -ftrapuv -I/opt/netcdf/4.6.1/INTEL/include/ CFLAGS: -O0 -g -traceback -ftrapuv -I/opt/netcdf/4.6.1/INTEL/include/

Additional context stack trace

Using NEW domaintypes and calls...
 netCDF single thread write
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source             
test_mpp_io        000000000044826B  Unknown               Unknown  Unknown
libpthread-2.28.s  00007F7686ED5DD0  Unknown               Unknown  Unknown
libnetcdf.so.13.1  00007F7689EB0842  Unknown               Unknown  Unknown
libnetcdf.so.13.1  00007F7689EAE5F4  NC4_def_var           Unknown  Unknown
libnetcdf.so.13.1  00007F7689E32B7B  nc_def_var            Unknown  Unknown
libnetcdff.so.6.1  00007F768A14E47F  nf_def_var_           Unknown  Unknown
libFMS.so.4.0.0    00007F768B70EEB9  mpp_io_mod_mp_mpp         459  mpp_io_write.inc
test_mpp_io        000000000041E17E  test_IP_test_netc         394  test_mpp_io.F90
test_mpp_io        000000000040EAF7  MAIN__                    123  test_mpp_io.F90
test_mpp_io        000000000040DD62  Unknown               Unknown  Unknown
libc-2.28.so       00007F768671E6A3  __libc_start_main     Unknown  Unknown
test_mpp_io        000000000040DC6E  Unknown               Unknown  Unknown

wrongkindofdoctor avatar Oct 05 '20 13:10 wrongkindofdoctor

@rem1776 can you look into this?

thomas-robinson avatar Oct 07 '20 10:10 thomas-robinson

@uramirez8707 This looks like a version issue with netcdf, nf_def_var only throws this error with netcdf/4.6.1, the same call and arguments with 4.7.4 returns successfully. Also tried a different test for mpp_io and it failed as well when the time axis was written.

rem1776 avatar Oct 09 '20 15:10 rem1776

@rem1776 have you looked at the values that it's writing?

thomas-robinson avatar Oct 09 '20 16:10 thomas-robinson

@thomas-robinson From here, it looks like its writing a double. The call to mpp_write_meta is with t which is axistype and I don't think t gets modified before the test.

The netcdf file itself also has a hdf error when I tried to use ncdump, but I'm not sure how relevant that is.

rem1776 avatar Oct 09 '20 17:10 rem1776

No i mean have you looked at the actual data. Is it all numbers? Is something off about it?

thomas-robinson avatar Oct 09 '20 18:10 thomas-robinson

No, the axis data looks to be the same in both cases.

rem1776 avatar Oct 09 '20 18:10 rem1776

FWIW, I've been able to isolate this to the -ftrapuv flag, which "Initializes stack local variables to an unusual value to aid error detection". If I take the flag out, it compiles without exception.

GFDL-Eric avatar Oct 13 '20 16:10 GFDL-Eric

According to this Intel article maybe we should replace -ftrapuv with -check uninit? The test passes when I do this replacement.

GFDL-Eric avatar Oct 13 '20 16:10 GFDL-Eric

Update: the same error occurs in test_diag_manager when writing a time axis metadata to a netcdf file. OS/ENV: Skylake, intel18_up4, run make distcheck

test1.1 successful: module/output_field=test_diag_manager_mod/dat1  Bounds of buffer exceeded.  Buffer bounds=  1: 10,  1: 10,  1: 10  Actual bounds=  1: 20,  1: 20,  1: 10
test1.2 successful
NOTE: Potential error in diag_manager_end: dat1 NOT available, check if output interval > runlength. Netcdf fill_values are written
forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source             
test_diag_manager  000000000045460E  Unknown               Unknown  Unknown
libpthread-2.17.s  00007FB8D6AB0630  Unknown               Unknown  Unknown
libnetcdf.so.13.1  00007FB8D918FDE6  Unknown               Unknown  Unknown
libnetcdf.so.13.1  00007FB8D918D9E2  NC4_def_var           Unknown  Unknown
libnetcdf.so.13.1  00007FB8D9102FAB  nc_def_var            Unknown  Unknown
libnetcdff.so.6.1  00007FB8D943FDC8  nf_def_var_           Unknown  Unknown
libFMS.so.5.0.0    00007FB8DA9A31E8  mpp_io_mod_mp_mpp         459  mpp_io_write.inc
libFMS.so.5.0.0    00007FB8DB2CA3A8  diag_output_mod_m        1571  diag_output.F90
libFMS.so.5.0.0    00007FB8DB33824A  diag_util_mod_mp_        2060  diag_util.F90
libFMS.so.5.0.0    00007FB8DB3631A1  diag_util_mod_mp_        2770  diag_util.F90
libFMS.so.5.0.0    00007FB8DB3563A4  diag_util_mod_mp_        2623  diag_util.F90
libFMS.so.5.0.0    00007FB8DB508365  diag_manager_mod_        3799  diag_manager.F90
libFMS.so.5.0.0    00007FB8DB5001F9  diag_manager_mod_        3705  diag_manager.F90
test_diag_manager  000000000042664A  MAIN__                    995  test_diag_manager.F90

wrongkindofdoctor avatar Oct 22 '20 12:10 wrongkindofdoctor