netcdf-fortran
netcdf-fortran copied to clipboard
Test hang on Fedora rawhide s390x with mpich
To report a non-security related issue, please provide:
-
the version of the software with which you are encountering an issue - 4.5..4
-
environmental information (i.e. Operating System, compiler info, java version, python version, etc.) Fedora Rawhide
I'm working on updating the Fedora netcdf-fortran package to version 4.5.4. However, a test is hanging when run on s390x (other arches are fine). This is with:
mpich-3.4.1-3.fc35.s390x hdf5-mpich-1.12.1-1.fc36.s390x netcdf-mpich-4.8.1-2.fc36.s390x
The test program is nf03_test4/f90tst_parallel_compressed and is run with:
mpiexec -np 4 ./f90tst_parallel_compressed
Here's a backtrace where it seems to be stuck:
#0 0x000003ffb14fbc94 in MPID_nem_tcp_connpoll (in_blocking_poll=<optimized out>) at src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1822
#1 0x000003ffb14ef78c in MPID_nem_network_poll (in_blocking_progress=1) at src/mpid/ch3/channels/nemesis/src/mpid_nem_network_poll.c:11
#2 MPID_nem_mpich_blocking_recv (completions=615, in_fbox=<synthetic pointer>, cell=<synthetic pointer>) at ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:964
#3 MPIDI_CH3I_Progress (progress_state=0x3ffdf66c168, is_blocking=<optimized out>) at src/mpid/ch3/channels/nemesis/src/ch3_progress.c:363
#4 0x000003ffb1b2ecc0 in MPIR_Wait_state (status=<optimized out>, state=0x3ffdf66c24c, request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at src/mpi/request/wait.c:32
#5 MPIR_Wait_impl.isra.0 (request_ptr=request_ptr@entry=0x3ffb1d17510 <MPIR_Request_direct>, status=<optimized out>) at src/mpi/request/wait.c:55
#6 0x000003ffb1484ca0 in MPID_Wait (status=0x1, request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at ./src/mpid/ch3/include/mpidpost.h:264
#7 MPIC_Wait (request_ptr=0x3ffb1d17510 <MPIR_Request_direct>, errflag=errflag@entry=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:64
#8 0x000003ffb1485a04 in MPIC_Sendrecv (sendbuf=<optimized out>, sendcount=1, sendtype=<optimized out>, dest=<optimized out>, sendtag=<optimized out>, recvbuf=0x2aa1bf8f8b0, recvcount=1, recvtype=1275069445, source=1, recvtag=14,
comm_ptr=0x3ffae42f690, status=<optimized out>, errflag=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:333
#9 0x000003ffb141fe4e in MPIR_Allreduce_intra_recursive_doubling (sendbuf=<optimized out>, recvbuf=0x3ffdf66c9e0, count=<optimized out>, datatype=<optimized out>, op=<optimized out>, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec)
at src/mpi/coll/allreduce/allreduce_intra_recursive_doubling.c:127
#10 0x000003ffb1322a68 in MPIR_Allreduce_allcomm_auto (sendbuf=0x3ffdf66c9d0, sendbuf@entry=<error reading variable: value has been optimized out>, recvbuf=0x3ffdf66c9e0,
recvbuf@entry=<error reading variable: value has been optimized out>, count=count@entry=1, datatype=datatype@entry=1275069445, op=op@entry=1476395011, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec)
at src/mpi/coll/allreduce/allreduce.c:145
#11 0x000003ffb1322c28 in MPIR_Allreduce_impl (sendbuf=<error reading variable: value has been optimized out>, recvbuf=<error reading variable: value has been optimized out>, count=<optimized out>, datatype=<optimized out>,
op=op@entry=1476395011, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec) at src/mpi/coll/allreduce/allreduce.c:210
#12 0x000003ffb1322d6a in MPID_Allreduce (errflag=0x3ffdf66c8ec, comm=0x3ffae42f690, op=op@entry=1476395011, datatype=<optimized out>, datatype@entry=1275069445, count=<optimized out>, count@entry=1,
recvbuf=<error reading variable: value has been optimized out>, sendbuf=<error reading variable: value has been optimized out>) at ./src/mpid/ch3/include/mpid_coll.h:41
#13 MPIR_Allreduce (sendbuf=<error reading variable: value has been optimized out>, recvbuf=<error reading variable: value has been optimized out>, count=count@entry=1, datatype=datatype@entry=1275069445, op=op@entry=1476395011,
comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec) at src/mpi/coll/allreduce/allreduce.c:262
#14 0x000003ffb1323742 in PMPI_Allreduce (sendbuf=<optimized out>, sendbuf@entry=<error reading variable: value has been optimized out>, recvbuf=<optimized out>, recvbuf@entry=0x3ffdf66c9e0, count=count@entry=1,
datatype=datatype@entry=1275069445, op=op@entry=1476395011, comm=-1006632958) at src/mpi/coll/allreduce/allreduce.c:387
#15 0x000003ffb0e11644 in H5D__mpio_array_gatherv.constprop.0 (local_array=<optimized out>, local_array@entry=0x2aa1bfac440, local_array_num_entries=<optimized out>, _gathered_array=<optimized out>,
_gathered_array@entry=0x3ffdf66cb78, _gathered_array_num_entries=_gathered_array_num_entries@entry=0x3ffdf66cb88, allgather=allgather@entry=true, comm=-1006632958, sort_func=0x0, root=0, array_entry_size=368)
at ../../src/H5Dmpio.c:552
#16 0x000003ffb0e086cc in H5D__link_chunk_filtered_collective_io (fm=0x2aa1bef9c60, type_info=0x3ffdf66ce00, io_info=0x3ffdf66ce80) at ../../src/H5Dmpio.c:1409
#17 H5D__chunk_collective_io (io_info=0x3ffdf66ce80, type_info=0x3ffdf66ce00, fm=fm@entry=0x2aa1bef9c60) at ../../src/H5Dmpio.c:886
#18 0x000003ffb0e0abb4 in H5D__chunk_collective_write (io_info=<optimized out>, type_info=<optimized out>, nelmts=<optimized out>, file_space=<optimized out>, mem_space=0x2aa1bf8fac0, fm=0x2aa1bef9c60) at ../../src/H5Dmpio.c:967
#19 0x000003ffb0b78702 in H5D__write (dataset=0x2aa1bf0cda0, mem_type_id=<optimized out>, mem_space=<optimized out>, file_space=0x2aa1beee270, buf=<optimized out>) at ../../src/H5Dio.c:761
#20 0x000003ffb0dd9598 in H5VL__native_dataset_write (obj=0x2aa1bf0cda0, mem_type_id=216172782113784251, mem_space_id=<optimized out>, file_space_id=288230376151711833, dxpl_id=<optimized out>, buf=0x3ffa41b9010, req=0x0)
at ../../src/H5VLnative_dataset.c:206
#21 0x000003ffb0db7b72 in H5VL__dataset_write (obj=<optimized out>, cls=<optimized out>, mem_type_id=mem_type_id@entry=216172782113784251, mem_space_id=mem_space_id@entry=288230376151711834,
file_space_id=file_space_id@entry=288230376151711833, dxpl_id=792633534417207358, buf=0x3ffa41b9010, req=0x0) at ../../src/H5VLcallback.c:2079
#22 0x000003ffb0dc1b9a in H5VL_dataset_write (vol_obj=0x2aa1bf0cd00, mem_type_id=216172782113784251, mem_space_id=288230376151711834, file_space_id=288230376151711833, dxpl_id=<optimized out>, buf=0x3ffa41b9010, req=0x0)
at ../../src/H5VLcallback.c:2111
#23 0x000003ffb0b7506c in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113784251, mem_space_id=288230376151711834, file_space_id=288230376151711833, dxpl_id=792633534417207358, buf=0x3ffa41b9010) at ../../src/H5Dio.c:289
#24 0x000003ffb1f8fdc8 in NC4_put_vars (ncid=<optimized out>, varid=0, startp=0x0, countp=0x2aa1bf0e720, stridep=0x3ffdf66f388, data=<optimized out>, mem_nc_type=5) at ../libhdf5/../../libhdf5/hdf5var.c:1594
#25 0x000003ffb1f90148 in NC4_put_vara (ncid=<optimized out>, varid=<optimized out>, startp=<optimized out>, countp=<optimized out>, op=<optimized out>, memtype=5) at ../libhdf5/../../libhdf5/hdf5var.c:1178
#26 0x000003ffb1f3b96c in NC_put_vara (ncid=<optimized out>, varid=<optimized out>, start=0x2aa1bf8f620, edges=0x2aa1bf8f910, value=value@entry=0x3ffa41b9010, memtype=5) at ../libdispatch/../../libdispatch/dvarput.c:97
#27 0x000003ffb1f3be08 in nc_put_vara_float (ncid=<optimized out>, varid=<optimized out>, startp=<optimized out>, countp=<optimized out>, op=0x3ffa41b9010) at ../libdispatch/../../libdispatch/dvarput.c:695
#28 0x000003ffb269464e in nf_put_vara_real_ (ncid=<optimized out>, varid=<optimized out>, start=<error reading variable: value has been optimized out>, counts=<error reading variable: value has been optimized out>, rvals=...)
at ../../fortran/nf_varaio.F90:372
#29 0x000003ffb26ede9e in netcdf::nf90_put_var_4d_fourbytereal (ncid=<error reading variable: value has been optimized out>, varid=<error reading variable: value has been optimized out>,
values=<error reading variable: value has been optimized out>, start=<error reading variable: value has been optimized out>,
count=<error reading variable: value requires 27021597768417284 bytes, which is more than max-value-size>, stride=<error reading variable: value has been optimized out>, map=<error reading variable: value has been optimized out>)
at netcdf_expanded.f90:1035
#30 0x000002aa1b1840ba in f90tst_parallel_compressed () at ../../nf03_test4/f90tst_parallel_compressed.F90:255
#31 0x000002aa1b182588 in main (argc=<optimized out>, argv=<optimized out>) at ../../nf03_test4/f90tst_parallel_compressed.F90:14
It seems to be stuck in MPID_nem_mpich_blocking_recv.
I have no idea how to debug this further.
It does appear to have been introduced between netcdf-fortran 4.5.2 and 4.5.4. I can still build and run the tests for 4.5.2.
I will get the appropriate environment up and running and see if I can replicate and then (hopefully) start debugging. Thanks!
I see now that f90tst_parallel_compressed is new in 4.5.4, so hard to say if anything changed to start triggering this or if this test would always have hung.
Also, since I realized that I never packaged 4.5.3 I tried that. With that I seem to be seeing a test hang with 32-bit arm and mpich but not any other arches. Fun.
Thanks as always @opoplawski. I'm bouncing around between a few issues as well as trying to address some things that have cropped up over in netcdf-c
. A lot of yak-shaving thusfar insofar as I'm setting up an array of dev environments to try to tackle what appear to be platform/architecture-specific issues.
Thanks!
Disregard the comment about the arm 4.5.3 - it did complete.
Is this still happening with the new release?
This is happening with 4.5.4 which is the latest.
This may be unrelated but happens to involve the same test. In the process of testing our conda-forge build, I found that the extension in CMakeList.txt
for this file is incorrect:
https://github.com/Unidata/netcdf-fortran/blob/v4.5.4/nf03_test4/CMakeLists.txt#L51
It should be F90
, not f90
. Tests on some of our platforms failed without a patch to fix this. I'm wondering if the issue might be related to CMake guessing an incorrect file type. This could also be a red herring.
I don't think it's relevant here because we are doing autotools builds, not cmake, and the extension is correct in Makefile.am.
For the record, the issue actually breaks CMake builds:
CMake Error at CMakeLists.txt:428 (ADD_EXECUTABLE):
Cannot find source file:
f90tst_parallel_compressed.f90
Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
.hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc
Call Stack (most recent call first):
nf03_test4/CMakeLists.txt:51 (build_bin_test)
CMake Error at CMakeLists.txt:428 (ADD_EXECUTABLE):
No SOURCES given to target: f90tst_parallel_compressed
Call Stack (most recent call first):
nf03_test4/CMakeLists.txt:51 (build_bin_test)
So yeah, you would not see the test failure if building with CMake.
(Created https://github.com/Unidata/netcdf-fortran/pull/339 for that)