PIO error when using gnu (> v10.1.0) and MPT
When using gnu compiler with MPT, PIO sync fails (seemingly randomly) as segmentation fault (invalid memory reference).
Using intel compiler with MPT works fine.
Using gnu with openmpi works fine (seems to be).
This error happen with mizuRoute with large high resolution river network data (MERIT-Hydro)
I have been running into this problem for long time (for several years now).
More specific configuration is: gnu v12.1.0 netcdf v 4.8.1 pnetcdf v1.12.3 mpt v2.25
The trace back looks like this (run with debug mode: flag is -g -Wall -fmax-errors=0 -fbacktrace -fcheck=all). 14 through 25 are not displayed: they would be in C codes.
piolib_mod.F90 Line 1372 is just PIOc_sync(file%fh)
#13 0x2b9d2f8c8f66 in PMPI_File_write_at_all
at /usr/src/packages/BUILD/mpt/lib/libmpi/src/romio/mpi-io/write_atall.c:61
#14 0xc53728 in ???
#15 0xc3ae8f in ???
#16 0xc38984 in ???
#17 0xc3a4f2 in ???
#18 0xc369ce in ???
#19 0xc37203 in ???
#20 0xb99763 in ???
#21 0x7b8fc1 in ???
#22 0x7b365e in ???
#23 0x7b917b in ???
#24 0x78559b in ???
#25 0x7077a9 in __piolib_mod_MOD_syncfile
at /glade/u/home/mizukami/sandbox_mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1372
#26 0x4193f2 in __pio_utils_MOD_sync_file
at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/pio_utils.f90:391
#27 0x46dcc8 in __historyfile_MOD_write_flux
at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/historyFile.f90:483
#28 0x58a35e in __write_simoutput_pio_MOD_output
at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/write_simoutput_pio.f90:224
#29 0x7042d8 in route_runoff
at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81
#30 0x7043f7 in main
at /glade/u/home/mizukami/sandbox_mizuRoute/route/build/../build/src/standalone/route_runoff.f90:11
MPT ERROR: MPI_COMM_WORLD rank 0 has terminated without calling MPI_Finalize()
aborting job
MPT: Received signal 11
I do have some GNU tests that work in the latest...
ERI_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.izumi_gnu.mizuroute-default ERI_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default RS_PS.f19_f19_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default ERS_PS.f19_f19_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default ERS_PS.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default ERS_PS.nldas2_nldas2_rUSGS_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default PET_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default PET_P215x8.nldas2_nldas2_rHDMA_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default PFS.f19_f19_rHDMA_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default SMS.f09_f09_rHDMAlk_mg17.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default SMS_D.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default SMS_D_Mmpi-serial.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default SMS_Mmpi-serial_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default SMS_P720x4.nldas2_nldas2_rMERIT_mnldas2.I2000Clm50SpMizGs.cheyenne_gnu.mizuroute-default
But, it also seems that this requires running for at least 10 years before it shows up.
This has:
gnu/10.1.0 mpt/2.25 netcdf-mpi/4.9.0 pnetcdf/1.12.3
More updates. @ekluzek, do you think this is enough information for someone to tell what is the root cause for the error??
This is a test based on derecho with gcc and cray-mpich. The modules loaded for compilation and runs are:
1) ncarenv/23.09 (S) 2) cmake/3.26.3 3) nccmp/1.9.1.0 4) ncview/2.1.9 5) conda/latest 6) cdo/2.2.2 7) nco/5.1.6 8) gcc/12.2.0 9) hdf5/1.12.2 10) netcdf/4.9.2 11) ncarcompilers/1.0.0 12) craype/2.7.23 13) cray-mpich/8.1.27 14) parallel-netcdf/1.12.3
Note that intel/cray-mpich and gcc/openmpi5.0.0 works fine.
The run died after several time iterations at pio_synch call. Using DDT, I was able to trace back to the pio function where it stopped.
#29 route_runoff () at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/standalone/route_runoff.f90:81 (at 0x6e187a)
#28 write_simoutput_pio::output (ierr=0, message='o\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000!\\000\\000\\000\\000\\000\\000\\000\\201\\000\\000\\000B\\025\\000\\000pY,\\022\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\
\000\\000\\000\\000\\000\320\266\\227\\r\\000\\000\\000\\000\\360~\\227\\r\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\340\256\211\\f\\000\\000\\000\\000\\217\\340\\265)Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000P\\35
2\\211\\f\\000\\000\\000\\000\\001\\024\\265)Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000p)\\227\\r\\000\\000\\000\\000\\001\\360\\264)Y\\024\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\0000x\\227\\r\\000\\000\\000\\000@s\\266'.
.., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/write_simoutput_pio.f90:218 (at 0x5881bc)
#27 historyfile::sync (this=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000\\240\\344;\\036\\000\\000\\000\\000\\260ky\\a\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000 \\000\\000\
\000\\000\\000\\000\\000P\\360@\\036\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\300\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000
\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\b\\000\\000\\000\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\200\\255l\\001\\000\\000\\000\\0
00\\200\\255l\\001\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/historyFile.f90:354 (at 0x47c572)
#26 pio_utils::sync_file (piofiledesc=(...), ierr=0, message='s\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000A\\000\\000\\000\\000\\000\\000\\000`\\'-\\a\\000\\000\\000\\000-\\303\\002\\000ch/vG\\002\\003\\000\\000\\000\\000\\000\\000\\000\\000\\000\\00
0\\000\\000\\000\\020\\276\\363\\035\\000\\000\\000\\000t\\305\\005\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\361\\017\\000\\000\\000\\000\\000\\0000\\221\\255\\031\\000\\000\\000\\000PY\\255\\031\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\
360\\025\\254\\031\\000\\000\\000\\000\\004\\247\\305(Y\\024\\000\\000@\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\3400\\254\\031\\000\\000\\000\\000I\\200\\305(Y\\024\\000\\000\\200\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\020f\\254\\031\\000\\000\\000\
\000'..., _message=256) at /glade/u/home/mizukami/model/mizuRoute/route/build/../build/src/pio_utils.f90:409 (at 0x43578e)
#25 piolib_mod::syncfile (file=(...)) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/flib/piolib_mod.F90:1470 (at 0x6e5e5a)
#24 PIOc_sync (ncid=129) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_file.c:422 (at 0x76f51a)
#23 flush_buffer (ncid=129, wmb=0x1871f970, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1820 (at 0x7a9af0)
#22 PIOc_write_darray_multi (ncid=129, varids=0x1b5a8020, ioid=512, nvars=5, arraylen=42191, array=0x125066f0, frame=0x19175c40, fillvalue=0x0, flushtodisk=true) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray.c:420 (at 0x7a3b94)
#21 flush_output_buffer (file=0x190c47d0, force=true, addsize=0) at /glade/u/home/mizukami/model/mizuRoute/libraries/parallelio/src/clib/pio_darray_int.c:1765 (at 0x7a995a)
#20 ncmpi_wait_all () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x15425120f3cc)
#19 ncmpio_wait () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1f9b)
#18 req_commit () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c1751)
#17 wait_getput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c534c)
#16 req_aggregation () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c3781)
#15 mgetput () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512c5d1a)
#14 ncmpio_read_write () from /glade/u/apps/derecho/23.09/spack/opt/spack/parallel-netcdf/1.12.3/cray-mpich/8.1.27/gcc/12.2.0/sq5u/lib/libpnetcdf.so.4 (at 0x1542512cb319)
#13 PMPI_File_write_at_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c9791)
#12 MPIOI_File_write_all () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500c7e59)
#11 ADIOI_GPFS_WriteStridedColl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500d6216)
#10 ADIOI_GPFS_Calc_others_req () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500cede3)
#9 PMPI_Alltoallv () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2e1ea)
#8 MPIR_Alltoallv_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d1f8)
#7 MPIR_Alltoallv_intra_auto () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424db2d096)
#6 MPIR_Alltoallv_intra_scattered () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f5b8b82)
#5 MPIC_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424f6db226)
#4 MPIR_Waitall () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e97a22f)
#3 MPIR_Waitall_impl () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424e911dc1)
#2 MPIDI_SHMI_progress () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff0092f)
#1 MPIR_Cray_Memcpy_wrapper () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x15424ff3aea4)
#0 _cray_mpi_memcpy_rome () from /opt/cray/pe/mpich/8.1.27/ofi/gnu/9.1/lib/libmpi_gnu_91.so.12 (at 0x1542500a5f50)
Hi @ekluzek, I heard some issues on pnetcdf in CESM I/O during the CESM workshop (I believe at CSEG working group AND at ultra-high resolution modeling session). Coincidently I did notice that the output error in mizuRoute happens with PIO built with pnetcdf support. When PIO is built without pnetcdf (just use netcdf), mizuRoute PIO output is stable. Note that this happens only for PIO built with gnu and cray-mpich.
@nmizukami in looking at both ParallelIO and pnetcdf github pages I don't see an issue about something that might explain this.
can you figure out which talks talked about this? Then we could watch the video and figure out where they talk about this. And then there might be more context to figure out where this will be talked about.