hdf5 icon indicating copy to clipboard operation
hdf5 copied to clipboard

t_pmulti_dset hangs on Fedora Rawhide aarch64 with mpich

Open opoplawski opened this issue 2 years ago • 5 comments

Describe the bug Trying to build hdf5_1_14 branch in Fedora Rawhide on aarch64. Koji builder seems to hang with:

make[4]: Leaving directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
make[4]: Entering directory '/builddir/build/BUILD/hdf5-hdf5_1_14/mpich/testpar'
============================
Testing: t_pmulti_dset 

It does not appear that the alarm goes off either.

Platform (please complete the following information)

  • HDF5 version hdf5_1_14 from Oct 20, 2023
  • OS and version Fedora Rawhide
  • Compiler and version gcc 13.2.1
  • Build system (e.g. CMake, Autotools) and version - autotools
  • Any configure options you specified
+ ../configure --build=aarch64-redhat-linux-gnu --host=aarch64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/usr --exec-prefix=/usr --bindir=/usr/bin --sbindir=/usr/sbin --sysconfdir=/etc --datadir=/usr/share --includedir=/usr/include --libdir=/usr/lib64 --libexecdir=/usr/libexec --localstatedir=/var --runstatedir=/run --sharedstatedir=/var/lib --mandir=/usr/share/man --infodir=/usr/share/info --disable-silent-rules --enable-fortran --enable-hl --enable-shared --with-szlib CC=mpicc CXX=mpicxx F9X=mpif90 'FCFLAGS=-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Wp,-U_FORTIFY_SOURCE,-D_FORTIFY_SOURCE=3 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -mbranch-protection=standard -fasynchronous-unwind-tables -fstack-clash-protection -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -I/usr/lib64/gfortran/modules  -I/usr/lib64/gfortran/modules/mpich' --enable-parallel --exec-prefix=/usr/lib64/mpich --libdir=/usr/lib64/mpich/lib --bindir=/usr/lib64/mpich/bin --sbindir=/usr/lib64/mpich/sbin --includedir=/usr/include/mpich-aarch64 --datarootdir=/usr/lib64/mpich/share --mandir=/usr/lib64/mpich/share/man --with-default-plugindir=/usr/lib64/mpich/hdf5/plugin

  • MPI library and version (parallel HDF5) mpich-4.1.2

opoplawski avatar Oct 25 '23 02:10 opoplawski

I'm not seeing this with latest hdf5_1_14 and latest Fedora Rawhide.

opoplawski avatar Mar 29 '24 02:03 opoplawski

I think this may be intermittent. Seen again now with 1.14.5, mpich 4.2.2

opoplawski avatar Oct 20 '24 18:10 opoplawski

Now seen once with hdf5 1.14.6 on ppc64le, mpich 4.2.2.

opoplawski avatar Feb 13 '25 02:02 opoplawski

Thanks for the report @opoplawski. Would it be possible to try with MPICH 4.3.0 to rule out whether it's an MPICH issue? Since you previously tested with 4.1.2 and now 4.2.2, I'm assuming it's our issue but it's good to be sure. Also, do you happen to see any warnings in the log when building that test? If so, would you be able to either post those here or upload the log? Similar to #2510, it could be that this is due to some assumptions in the test code.

jhendersonHDF avatar Feb 13 '25 17:02 jhendersonHDF

#2510 is an allocation issue. I'll create a PR for the fix next week.

derobins avatar Feb 14 '25 20:02 derobins