netcdf-fortran icon indicating copy to clipboard operation
netcdf-fortran copied to clipboard

Test hang on Fedora rawhide s390x with mpich

Open opoplawski opened this issue 2 years ago • 10 comments

To report a non-security related issue, please provide:

  • the version of the software with which you are encountering an issue - 4.5..4

  • environmental information (i.e. Operating System, compiler info, java version, python version, etc.) Fedora Rawhide

    I'm working on updating the Fedora netcdf-fortran package to version 4.5.4. However, a test is hanging when run on s390x (other arches are fine). This is with:

mpich-3.4.1-3.fc35.s390x hdf5-mpich-1.12.1-1.fc36.s390x netcdf-mpich-4.8.1-2.fc36.s390x

The test program is nf03_test4/f90tst_parallel_compressed and is run with:

mpiexec -np 4 ./f90tst_parallel_compressed

Here's a backtrace where it seems to be stuck:

#0  0x000003ffb14fbc94 in MPID_nem_tcp_connpoll (in_blocking_poll=<optimized out>) at src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c:1822
#1  0x000003ffb14ef78c in MPID_nem_network_poll (in_blocking_progress=1) at src/mpid/ch3/channels/nemesis/src/mpid_nem_network_poll.c:11
#2  MPID_nem_mpich_blocking_recv (completions=615, in_fbox=<synthetic pointer>, cell=<synthetic pointer>) at ./src/mpid/ch3/channels/nemesis/include/mpid_nem_inline.h:964
#3  MPIDI_CH3I_Progress (progress_state=0x3ffdf66c168, is_blocking=<optimized out>) at src/mpid/ch3/channels/nemesis/src/ch3_progress.c:363
#4  0x000003ffb1b2ecc0 in MPIR_Wait_state (status=<optimized out>, state=0x3ffdf66c24c, request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at src/mpi/request/wait.c:32
#5  MPIR_Wait_impl.isra.0 (request_ptr=request_ptr@entry=0x3ffb1d17510 <MPIR_Request_direct>, status=<optimized out>) at src/mpi/request/wait.c:55
#6  0x000003ffb1484ca0 in MPID_Wait (status=0x1, request_ptr=0x3ffb1d17510 <MPIR_Request_direct>) at ./src/mpid/ch3/include/mpidpost.h:264
#7  MPIC_Wait (request_ptr=0x3ffb1d17510 <MPIR_Request_direct>, errflag=errflag@entry=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:64
#8  0x000003ffb1485a04 in MPIC_Sendrecv (sendbuf=<optimized out>, sendcount=1, sendtype=<optimized out>, dest=<optimized out>, sendtag=<optimized out>, recvbuf=0x2aa1bf8f8b0, recvcount=1, recvtype=1275069445, source=1, recvtag=14, 
    comm_ptr=0x3ffae42f690, status=<optimized out>, errflag=0x3ffdf66c8ec) at src/mpi/coll/helper_fns.c:333
#9  0x000003ffb141fe4e in MPIR_Allreduce_intra_recursive_doubling (sendbuf=<optimized out>, recvbuf=0x3ffdf66c9e0, count=<optimized out>, datatype=<optimized out>, op=<optimized out>, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec)
    at src/mpi/coll/allreduce/allreduce_intra_recursive_doubling.c:127
#10 0x000003ffb1322a68 in MPIR_Allreduce_allcomm_auto (sendbuf=0x3ffdf66c9d0, sendbuf@entry=<error reading variable: value has been optimized out>, recvbuf=0x3ffdf66c9e0, 
    recvbuf@entry=<error reading variable: value has been optimized out>, count=count@entry=1, datatype=datatype@entry=1275069445, op=op@entry=1476395011, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec)
    at src/mpi/coll/allreduce/allreduce.c:145
#11 0x000003ffb1322c28 in MPIR_Allreduce_impl (sendbuf=<error reading variable: value has been optimized out>, recvbuf=<error reading variable: value has been optimized out>, count=<optimized out>, datatype=<optimized out>, 
    op=op@entry=1476395011, comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec) at src/mpi/coll/allreduce/allreduce.c:210
#12 0x000003ffb1322d6a in MPID_Allreduce (errflag=0x3ffdf66c8ec, comm=0x3ffae42f690, op=op@entry=1476395011, datatype=<optimized out>, datatype@entry=1275069445, count=<optimized out>, count@entry=1, 
    recvbuf=<error reading variable: value has been optimized out>, sendbuf=<error reading variable: value has been optimized out>) at ./src/mpid/ch3/include/mpid_coll.h:41
#13 MPIR_Allreduce (sendbuf=<error reading variable: value has been optimized out>, recvbuf=<error reading variable: value has been optimized out>, count=count@entry=1, datatype=datatype@entry=1275069445, op=op@entry=1476395011, 
    comm_ptr=0x3ffae42f690, errflag=0x3ffdf66c8ec) at src/mpi/coll/allreduce/allreduce.c:262
#14 0x000003ffb1323742 in PMPI_Allreduce (sendbuf=<optimized out>, sendbuf@entry=<error reading variable: value has been optimized out>, recvbuf=<optimized out>, recvbuf@entry=0x3ffdf66c9e0, count=count@entry=1, 
    datatype=datatype@entry=1275069445, op=op@entry=1476395011, comm=-1006632958) at src/mpi/coll/allreduce/allreduce.c:387
#15 0x000003ffb0e11644 in H5D__mpio_array_gatherv.constprop.0 (local_array=<optimized out>, local_array@entry=0x2aa1bfac440, local_array_num_entries=<optimized out>, _gathered_array=<optimized out>, 
    _gathered_array@entry=0x3ffdf66cb78, _gathered_array_num_entries=_gathered_array_num_entries@entry=0x3ffdf66cb88, allgather=allgather@entry=true, comm=-1006632958, sort_func=0x0, root=0, array_entry_size=368)
    at ../../src/H5Dmpio.c:552
#16 0x000003ffb0e086cc in H5D__link_chunk_filtered_collective_io (fm=0x2aa1bef9c60, type_info=0x3ffdf66ce00, io_info=0x3ffdf66ce80) at ../../src/H5Dmpio.c:1409
#17 H5D__chunk_collective_io (io_info=0x3ffdf66ce80, type_info=0x3ffdf66ce00, fm=fm@entry=0x2aa1bef9c60) at ../../src/H5Dmpio.c:886
#18 0x000003ffb0e0abb4 in H5D__chunk_collective_write (io_info=<optimized out>, type_info=<optimized out>, nelmts=<optimized out>, file_space=<optimized out>, mem_space=0x2aa1bf8fac0, fm=0x2aa1bef9c60) at ../../src/H5Dmpio.c:967
#19 0x000003ffb0b78702 in H5D__write (dataset=0x2aa1bf0cda0, mem_type_id=<optimized out>, mem_space=<optimized out>, file_space=0x2aa1beee270, buf=<optimized out>) at ../../src/H5Dio.c:761
#20 0x000003ffb0dd9598 in H5VL__native_dataset_write (obj=0x2aa1bf0cda0, mem_type_id=216172782113784251, mem_space_id=<optimized out>, file_space_id=288230376151711833, dxpl_id=<optimized out>, buf=0x3ffa41b9010, req=0x0)
    at ../../src/H5VLnative_dataset.c:206
#21 0x000003ffb0db7b72 in H5VL__dataset_write (obj=<optimized out>, cls=<optimized out>, mem_type_id=mem_type_id@entry=216172782113784251, mem_space_id=mem_space_id@entry=288230376151711834, 
    file_space_id=file_space_id@entry=288230376151711833, dxpl_id=792633534417207358, buf=0x3ffa41b9010, req=0x0) at ../../src/H5VLcallback.c:2079
#22 0x000003ffb0dc1b9a in H5VL_dataset_write (vol_obj=0x2aa1bf0cd00, mem_type_id=216172782113784251, mem_space_id=288230376151711834, file_space_id=288230376151711833, dxpl_id=<optimized out>, buf=0x3ffa41b9010, req=0x0)
    at ../../src/H5VLcallback.c:2111
#23 0x000003ffb0b7506c in H5Dwrite (dset_id=<optimized out>, mem_type_id=216172782113784251, mem_space_id=288230376151711834, file_space_id=288230376151711833, dxpl_id=792633534417207358, buf=0x3ffa41b9010) at ../../src/H5Dio.c:289
#24 0x000003ffb1f8fdc8 in NC4_put_vars (ncid=<optimized out>, varid=0, startp=0x0, countp=0x2aa1bf0e720, stridep=0x3ffdf66f388, data=<optimized out>, mem_nc_type=5) at ../libhdf5/../../libhdf5/hdf5var.c:1594
#25 0x000003ffb1f90148 in NC4_put_vara (ncid=<optimized out>, varid=<optimized out>, startp=<optimized out>, countp=<optimized out>, op=<optimized out>, memtype=5) at ../libhdf5/../../libhdf5/hdf5var.c:1178
#26 0x000003ffb1f3b96c in NC_put_vara (ncid=<optimized out>, varid=<optimized out>, start=0x2aa1bf8f620, edges=0x2aa1bf8f910, value=value@entry=0x3ffa41b9010, memtype=5) at ../libdispatch/../../libdispatch/dvarput.c:97
#27 0x000003ffb1f3be08 in nc_put_vara_float (ncid=<optimized out>, varid=<optimized out>, startp=<optimized out>, countp=<optimized out>, op=0x3ffa41b9010) at ../libdispatch/../../libdispatch/dvarput.c:695
#28 0x000003ffb269464e in nf_put_vara_real_ (ncid=<optimized out>, varid=<optimized out>, start=<error reading variable: value has been optimized out>, counts=<error reading variable: value has been optimized out>, rvals=...)
    at ../../fortran/nf_varaio.F90:372
#29 0x000003ffb26ede9e in netcdf::nf90_put_var_4d_fourbytereal (ncid=<error reading variable: value has been optimized out>, varid=<error reading variable: value has been optimized out>, 
    values=<error reading variable: value has been optimized out>, start=<error reading variable: value has been optimized out>, 
    count=<error reading variable: value requires 27021597768417284 bytes, which is more than max-value-size>, stride=<error reading variable: value has been optimized out>, map=<error reading variable: value has been optimized out>)
    at netcdf_expanded.f90:1035
#30 0x000002aa1b1840ba in f90tst_parallel_compressed () at ../../nf03_test4/f90tst_parallel_compressed.F90:255
#31 0x000002aa1b182588 in main (argc=<optimized out>, argv=<optimized out>) at ../../nf03_test4/f90tst_parallel_compressed.F90:14

It seems to be stuck in MPID_nem_mpich_blocking_recv.

I have no idea how to debug this further.

It does appear to have been introduced between netcdf-fortran 4.5.2 and 4.5.4. I can still build and run the tests for 4.5.2.

opoplawski avatar Jan 14 '22 03:01 opoplawski

I will get the appropriate environment up and running and see if I can replicate and then (hopefully) start debugging. Thanks!

WardF avatar Jan 14 '22 03:01 WardF

I see now that f90tst_parallel_compressed is new in 4.5.4, so hard to say if anything changed to start triggering this or if this test would always have hung.

Also, since I realized that I never packaged 4.5.3 I tried that. With that I seem to be seeing a test hang with 32-bit arm and mpich but not any other arches. Fun.

opoplawski avatar Jan 14 '22 04:01 opoplawski

Thanks as always @opoplawski. I'm bouncing around between a few issues as well as trying to address some things that have cropped up over in netcdf-c. A lot of yak-shaving thusfar insofar as I'm setting up an array of dev environments to try to tackle what appear to be platform/architecture-specific issues.

Thanks!

WardF avatar Jan 14 '22 18:01 WardF

Disregard the comment about the arm 4.5.3 - it did complete.

opoplawski avatar Jan 15 '22 02:01 opoplawski

Is this still happening with the new release?

edwardhartnett avatar Jan 16 '22 14:01 edwardhartnett

This is happening with 4.5.4 which is the latest.

opoplawski avatar Jan 16 '22 23:01 opoplawski

This may be unrelated but happens to involve the same test. In the process of testing our conda-forge build, I found that the extension in CMakeList.txt for this file is incorrect: https://github.com/Unidata/netcdf-fortran/blob/v4.5.4/nf03_test4/CMakeLists.txt#L51 It should be F90, not f90. Tests on some of our platforms failed without a patch to fix this. I'm wondering if the issue might be related to CMake guessing an incorrect file type. This could also be a red herring.

xylar avatar Jan 20 '22 00:01 xylar

I don't think it's relevant here because we are doing autotools builds, not cmake, and the extension is correct in Makefile.am.

opoplawski avatar Jan 20 '22 02:01 opoplawski

For the record, the issue actually breaks CMake builds:

CMake Error at CMakeLists.txt:428 (ADD_EXECUTABLE):
  Cannot find source file:

    f90tst_parallel_compressed.f90

  Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .ixx .cppm .h
  .hh .h++ .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .hip .ispc
Call Stack (most recent call first):
  nf03_test4/CMakeLists.txt:51 (build_bin_test)


CMake Error at CMakeLists.txt:428 (ADD_EXECUTABLE):
  No SOURCES given to target: f90tst_parallel_compressed
Call Stack (most recent call first):
  nf03_test4/CMakeLists.txt:51 (build_bin_test)

So yeah, you would not see the test failure if building with CMake.

ArchangeGabriel avatar Mar 20 '22 23:03 ArchangeGabriel

(Created https://github.com/Unidata/netcdf-fortran/pull/339 for that)

ArchangeGabriel avatar Mar 20 '22 23:03 ArchangeGabriel