Hdf5 (1.8.18?) from bilder causing crashes in puffin on ubuntu 16.04
Using the Bilder built hdf5 and fftw3 libs, when building on Ubuntu 16.04 with the repo compilers, the hdf5 writing routines crash when running Puffin. This is the same as a previous issue, which was assumed fixed. - it still appears to be a problem.
The workaround just now is to build with the Ubuntu repo libs.
Is this on all files or is there a specific test input file, number of ranks and machine where we can reproduce this problem? This sounds like mixing of libs - did you use also system openmpi or bilder openmpi? Trying to pin this down, as I've not seen the crash.
This is using system Openmpi with gnu fortran. Using Bilder supplied HDF5 (v 1.8.13, but the same behaviour has now been confirmed on all Bilderized versions from 1.8.12 to 1.8.18) and CMake (v 3.4.1). Ubuntu 16.04. Bilder uses the system OpenMPI and fortran libs to build everything. Everything builds, but when running, we get the below output.
step size is --- 4.1887903213500980E-003
******************************
WARNING - field mesh may not be large enough in z2 - fixing....
Field mesh length in z2 now = 13.005309677124023
******************************
number of nodes in z2 --- 1657
240 Step(s) and z-bar 1.0053
There are no dispersive sections
TRANS AREA = 6.2831854820251465
FIXING CHARGE
Q = 7.2413410011642439E-009
SHOT-NOISE TURNED ON
-----------------------------------------
Total number of macroparticles = 800
Avg num of real electrons per macroparticle Nk = 8991697.7115548234
Total number of real electrons modelled = 7193358169.2438583
287
[sebastion:2513] *** An error occurred in MPI_Comm_dup
[sebastion:2513] *** reported by process [2800222209,0]
[sebastion:2513] *** on communicator MPI_COMM_WORLD
[sebastion:2513] *** MPI_ERR_COMM: invalid communicator
[sebastion:2513] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[sebastion:2513] *** and potentially your MPI job)
[sebastion:02507] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[sebastion:02507] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
A bit of fishing around with print statements shows that the MPI error is coming from the calls to h5pset_fapl_mpio_f - i.e.
CALL h5pset_fapl_mpio_f(plist_id, tProcInfo_G%comm, mpiinfo, error)
which is on multiple lines on hdf5PuffColl.f90.
This is for any example Puffin input deck (using hdf5 output, which is now default).
Ubuntu supplied hdf5 seems to work fine.
So I think the other workaround has to be to use bilder to build mpich or openmpi, rather than using ubuntu's system MPI. Would also be interesting to know if this manifests itself also on fedora. If it happens with bilders mpich, then we need to check no nasty bugs have got into the hdf mpi environment setup, which is going on at this stage. (pset = property setting). There is a slim possibility what I've done is only appropriate for a parallel filesystem, but I don't think that's the case. The "independent files" option was there to take care of that case.
... all this of course is speculative, and we should do some testing to be sure. I used bilder's mpich as part of a puffin build (though with an older branch), and did not experience such problems.