openPMD-api
openPMD-api copied to clipboard
Crash with `HDF5` backend due to incompatible `MPI_File_Sync` policy
Writing a dataset using OpenPMD on an NFS filesystem at Fermilab's wilson-cluster, work1 as part of a simple fodo_cxx simulation fails:
HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 0:
#000: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5D.c line 1390 in H5Dwrite(): can't synchronously write data
major: Dataset
minor: Write failed
#001: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5D.c line 1333 in H5D__write_api_common(): can't write data
major: Dataset
minor: Write failed
#002: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5VLcallback.c line 2282 in H5VL_dataset_write_direct(): dataset write failed
major: Virtual Object Layer
minor: Write failed
#003: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5VLcallback.c line 2237 in H5VL__dataset_write(): dataset write failed
major: Virtual Object Layer
minor: Write failed
#004: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5VLnative_dataset.c line 408 in H5VL__native_dataset_write(): can't write data
major: Dataset
minor: Write failed
#005: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5Dio.c line 673 in H5D__write(): unable to adjust I/O info for parallel I/O
major: Dataset
minor: Unable to initialize object
#006: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5Dio.c line 1211 in H5D__ioinfo_adjust(): Can't perform independent write when MPI_File_sync is required by ROMIO driver.
major: Dataset
minor: Can't perform independent IO
[AbstractIOHandlerImpl] IO Task WRITE_DATASET failed with exception. Clearing IO queue and passing on the exception.
[HDF5] Internal error: Failed to write dataset /data/0/particles/track_coords/moments/x
Abort(888) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 888) - process 0
srun: error: wcgwn013: task 0: Exited with exit code 120
First observed by @egstern.
Software Environment
- version of openPMD-api:
0.15.1 - installed openPMD-api via:
spack - operating system:
centos-7 - machine:
wilson-cluster@fermilab - name and version of Python implementation:
python-3.11 - version of HDF5:
1.14.0 - version of ADIOS2:
2.8.3 - name and version of MPI:
MPICH-4.1.1
Spack spec Full spec is here: https://gist.github.com/s-sajid-ali/0596b18b83400172067c417932b1a852
Additional Information
Let me know if I should try building openpmd-apiwith testing enabled (maybe add a spack variant for it) and see if that succeeds?
We tried setting HDF5_DO_MPI_FILE_SYNC=1 and it did not help.
This issue does not occur on the Lustre filesystem on the same cluster, wclustre.
When using openMPI@main~romio, no crash occurs for the same program on work1.
Full spec for OpenPMD is here : https://gist.github.com/s-sajid-ali/bbf0328c7fe1ecb3090d43fb3381f28c
Here is the result of `ompi_info`:
bash-4.2$ /wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/openmpi-main-sfsdxb5khtvn7r4rkiw4tgfldqzru5jk/bin/ompi_info
Package: Open MPI [email protected] Distribution
Open MPI: 5.1.0a1
Open MPI repo revision: 67620f4
Open MPI release date: Unreleased developer copy
MPI API: 3.1.0
Ident string: 5.1.0a1
Prefix: /wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/openmpi-main-sfsdxb5khtvn7r4rkiw4tgfldqzru5jk
Configured architecture: x86_64-pc-linux-gnu
Configured by: sasyed
Configured on: Fri Apr 14 22:26:36 UTC 2023
Configure host: wc.fnal.gov
Configure command line: '--prefix=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/openmpi-main-sfsdxb5khtvn7r4rkiw4tgfldqzru5jk'
'--enable-shared' '--disable-silent-rules'
'--disable-builtin-atomics' '--enable-static'
'--enable-mpi1-compatibility' '--without-verbs'
'--without-mxm' '--without-knem' '--without-ofi'
'--without-fca' '--without-xpmem'
'--with-ucx=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/ucx-1.14.0-ei775jwkepcnx5rsyyzy2bgbblbvb6d4'
'--without-hcoll' '--without-psm2' '--without-psm'
'--without-cma' '--without-cray-xpmem'
'--without-sge' '--without-tm'
'--without-loadleveler' '--without-alps'
'--with-slurm' '--without-lsf'
'--disable-memchecker'
'--with-libevent=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/libevent-2.1.12-76cfua2fg4pcodffcampmyebxseait2l'
'--with-lustre=/usr'
'--with-zlib=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/zlib-1.2.13-vyuxkpqw47jxvdvuth53a6ns3drcqf5b'
'--with-hwloc=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/hwloc-2.9.0-mdsykxr4565r7v5bijkq252kzcjvffmd'
'--disable-java' '--disable-mpi-java'
'--disable-io-romio' '--with-gpfs=no'
'--without-cuda' '--enable-wrapper-rpath'
'--disable-wrapper-runpath'
'--with-wrapper-ldflags=-Wl,-rpath,/srv/software/gnu12/12.2.0/lib/gcc/x86_64-pc-linux-gnu/12.2.0
-Wl,-rpath,/srv/software/gnu12/12.2.0/lib64'
Built by: sasyed
Built on: Fri Apr 14 22:40:28 UTC 2023
Built host: wc.fnal.gov
C bindings: yes
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the
/wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gfortran
compiler and/or Open MPI, does not support the
following: array subsections, direct passthru
(where possible) to underlying Open MPI's C
functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: rpath
C compiler: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gcc
C compiler absolute: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gcc
C compiler family name: GNU
C compiler version: 12.2.0
C++ compiler: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/g++
C++ compiler absolute: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/g++
Fort compiler: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gfortran
Fort compiler abs: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
OMPI progress: no, Event lib: yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: no
MPI_WTIME support: native
Symbol vis. support: yes
Host topology support: yes
IPv6 support: no
MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
Fault Tolerance support: yes
FT MPI support: yes
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.1.0)
MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.1.0)
MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.1.0)
MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.1.0)
MCA btl: uct (MCA v2.1.0, API v3.3.0, Component v5.1.0)
MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.1.0)
MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.1.0)
MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
v5.1.0)
MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.1.0)
MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.1.0)
MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.1.0)
MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
v5.1.0)
MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.1.0)
MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.1.0)
MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.1.0)
MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.1.0)
MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.1.0)
MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.1.0)
MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.1.0)
MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.1.0)
MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.1.0)
MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
v5.1.0)
MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
v5.1.0)
bash-4.2$
The issue is that want Synergia to be usable on as wide variety of platforms as possible. While NFS mounted disks are not recommended for parallel IO, that might be the environment some users, for example on a university cluster, have available.
Thanks for the report! Let's dig into this, could be a bug on our end, an HDF5 bug or an MPI-I/O bug.
Your comment here makes me suspect a ROMIO issue, which is a common MPI-I/O implementation used.
@s-sajid-ali In your error message, I see
minor: Can't perform independent IO
Can you rerun with export OPENPMD_HDF5_INDEPENDENT=OFF and check if this crashes as well?
We can also set this programmatically if needed.
If your previous parallel HDF5 routines do not show the same problem, then I suspect an issue on our end that we can fix! Can you confirm that this is the case?
@s-sajid-ali can you double check the same issue appears with HDF5 1.12.2? HDF5 1.14.0 is relatively new with major upgrades in the Virtual Object Layer (VOL) and this might be a regression.
CCing @jeanbez for potential insights on what to try for HDF5 on NFS?
The issue is that want Synergia to be usable on as wide variety of platforms as possible. While NFS mounted disks are not recommended for parallel IO, that might be the environment some users, for example on a university cluster, have available.
Absolutely, that is also our goal. Adding NFS guidance in #1427.
@s-sajid-ali @egstern can you confirm that other openPMD backends, e.g., ADIOS2 v2.8.3 do not show issues with your file systems?
@s-sajid-ali can you potentially add some printf debugging to the writes of /data/0/particles/track_coords/moments/x? Is there something special about it, e.g., writing zero particles in a chunk or similar? Which chunks where written before it crashes?
@s-sajid-ali could you try setting HDF5_DO_MPI_FILE_SYNC=FALSE ?
Can you rerun with export OPENPMD_HDF5_INDEPENDENT=OFF and check if this crashes as well?
@ax3l : No crash occurs with this environment variable set.
could you try setting HDF5_DO_MPI_FILE_SYNC=FALSE ?
@jeanbez : No crash with this either.
Your https://github.com/openPMD/openPMD-api/issues/1423#issuecomment-1509430878 makes me suspect a ROMIO issue, which is a common MPI-I/O implementation used.
The UnifyFS docs seem to mention setting ROMIO hint romio_synchronizing_flush as a more efficient alternate method to achieve the same. Is this something that could be used as a default in OpenPMD?
@s-sajid-ali @egstern can you confirm that other openPMD backends, e.g., ADIOS2 v2.8.3 do not show issues with your file systems?
@ax3l : No crash with the ADIOS2 back-end.
The UnifyFS docs seem to mention setting ROMIO hint romio_synchronizing_flush as a more efficient alternate method to achieve the same. Is this something that could be used as a default in OpenPMD?
This sounds to me like it is specific to ROMIO, OpenMPI uses their own MPI-I/O backend by default. Thus, it would not be portable and it would also reduce performance on all other, truly parallel filesystems.