openPMD-api icon indicating copy to clipboard operation
openPMD-api copied to clipboard

Crash with `HDF5` backend due to incompatible `MPI_File_Sync` policy

Open s-sajid-ali opened this issue 2 years ago • 12 comments
trafficstars

Writing a dataset using OpenPMD on an NFS filesystem at Fermilab's wilson-cluster, work1 as part of a simple fodo_cxx simulation fails:

HDF5-DIAG: Error detected in HDF5 (1.14.0) MPI-process 0:
  #000: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5D.c line 1390 in H5Dwrite(): can't synchronously write data
    major: Dataset
    minor: Write failed
  #001: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5D.c line 1333 in H5D__write_api_common(): can't write data
    major: Dataset
    minor: Write failed
  #002: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5VLcallback.c line 2282 in H5VL_dataset_write_direct(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #003: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5VLcallback.c line 2237 in H5VL__dataset_write(): dataset write failed
    major: Virtual Object Layer
    minor: Write failed
  #004: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5VLnative_dataset.c line 408 in H5VL__native_dataset_write(): can't write data
    major: Dataset
    minor: Write failed
  #005: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5Dio.c line 673 in H5D__write(): unable to adjust I/O info for parallel I/O
    major: Dataset
    minor: Unable to initialize object
  #006: /tmp/sasyed/spack-stage/spack-stage-hdf5-1.14.0-fs2ybrgllsgpb3ltdwkbqv5a3seilqya/spack-src/src/H5Dio.c line 1211 in H5D__ioinfo_adjust(): Can't perform independent write when MPI_File_sync is required by ROMIO driver.
    major: Dataset
    minor: Can't perform independent IO
[AbstractIOHandlerImpl] IO Task WRITE_DATASET failed with exception. Clearing IO queue and passing on the exception.
[HDF5] Internal error: Failed to write dataset /data/0/particles/track_coords/moments/x
Abort(888) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 888) - process 0
srun: error: wcgwn013: task 0: Exited with exit code 120

First observed by @egstern.

Software Environment

  • version of openPMD-api: 0.15.1
  • installed openPMD-api via: spack
  • operating system: centos-7
  • machine: wilson-cluster@fermilab
  • name and version of Python implementation: python-3.11
  • version of HDF5: 1.14.0
  • version of ADIOS2: 2.8.3
  • name and version of MPI: MPICH-4.1.1

Spack spec Full spec is here: https://gist.github.com/s-sajid-ali/0596b18b83400172067c417932b1a852

Additional Information Let me know if I should try building openpmd-apiwith testing enabled (maybe add a spack variant for it) and see if that succeeds?

s-sajid-ali avatar Apr 14 '23 20:04 s-sajid-ali

We tried setting HDF5_DO_MPI_FILE_SYNC=1 and it did not help.

This issue does not occur on the Lustre filesystem on the same cluster, wclustre.

s-sajid-ali avatar Apr 14 '23 20:04 s-sajid-ali

When using openMPI@main~romio, no crash occurs for the same program on work1.

Full spec for OpenPMD is here : https://gist.github.com/s-sajid-ali/bbf0328c7fe1ecb3090d43fb3381f28c

Here is the result of `ompi_info`:

bash-4.2$ /wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/openmpi-main-sfsdxb5khtvn7r4rkiw4tgfldqzru5jk/bin/ompi_info 
                 Package: Open MPI [email protected] Distribution
                Open MPI: 5.1.0a1
  Open MPI repo revision: 67620f4
   Open MPI release date: Unreleased developer copy
                 MPI API: 3.1.0
            Ident string: 5.1.0a1
                  Prefix: /wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/openmpi-main-sfsdxb5khtvn7r4rkiw4tgfldqzru5jk
 Configured architecture: x86_64-pc-linux-gnu
           Configured by: sasyed
           Configured on: Fri Apr 14 22:26:36 UTC 2023
          Configure host: wc.fnal.gov
  Configure command line: '--prefix=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/openmpi-main-sfsdxb5khtvn7r4rkiw4tgfldqzru5jk'
                          '--enable-shared' '--disable-silent-rules'
                          '--disable-builtin-atomics' '--enable-static'
                          '--enable-mpi1-compatibility' '--without-verbs'
                          '--without-mxm' '--without-knem' '--without-ofi'
                          '--without-fca' '--without-xpmem'
                          '--with-ucx=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/ucx-1.14.0-ei775jwkepcnx5rsyyzy2bgbblbvb6d4'
                          '--without-hcoll' '--without-psm2' '--without-psm'
                          '--without-cma' '--without-cray-xpmem'
                          '--without-sge' '--without-tm'
                          '--without-loadleveler' '--without-alps'
                          '--with-slurm' '--without-lsf'
                          '--disable-memchecker'
                          '--with-libevent=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/libevent-2.1.12-76cfua2fg4pcodffcampmyebxseait2l'
                          '--with-lustre=/usr'
                          '--with-zlib=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/zlib-1.2.13-vyuxkpqw47jxvdvuth53a6ns3drcqf5b'
                          '--with-hwloc=/wclustre/accelsim/spack-shared-v4/spack/opt/spack/linux-scientific7-ivybridge/gcc-12.2.0/hwloc-2.9.0-mdsykxr4565r7v5bijkq252kzcjvffmd'
                          '--disable-java' '--disable-mpi-java'
                          '--disable-io-romio' '--with-gpfs=no'
                          '--without-cuda' '--enable-wrapper-rpath'
                          '--disable-wrapper-runpath'
                          '--with-wrapper-ldflags=-Wl,-rpath,/srv/software/gnu12/12.2.0/lib/gcc/x86_64-pc-linux-gnu/12.2.0
                          -Wl,-rpath,/srv/software/gnu12/12.2.0/lib64'
                Built by: sasyed
                Built on: Fri Apr 14 22:40:28 UTC 2023
              Built host: wc.fnal.gov
              C bindings: yes
             Fort mpif.h: yes (all)
            Fort use mpi: yes (full: ignore TKR)
       Fort use mpi size: deprecated-ompi-info-value
        Fort use mpi_f08: yes
 Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
                          limitations in the
                          /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gfortran
                          compiler and/or Open MPI, does not support the
                          following: array subsections, direct passthru
                          (where possible) to underlying Open MPI's C
                          functionality
  Fort mpi_f08 subarrays: no
           Java bindings: no
  Wrapper compiler rpath: rpath
              C compiler: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gcc
     C compiler absolute: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gcc
  C compiler family name: GNU
      C compiler version: 12.2.0
            C++ compiler: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/g++
   C++ compiler absolute: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/g++
           Fort compiler: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gfortran
       Fort compiler abs: /wclustre/accelsim/spack-shared-v4/spack/lib/spack/env/gcc/gfortran
         Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
   Fort 08 assumed shape: yes
      Fort optional args: yes
          Fort INTERFACE: yes
    Fort ISO_FORTRAN_ENV: yes
       Fort STORAGE_SIZE: yes
      Fort BIND(C) (all): yes
      Fort ISO_C_BINDING: yes
 Fort SUBROUTINE BIND(C): yes
       Fort TYPE,BIND(C): yes
 Fort T,BIND(C,name="a"): yes
            Fort PRIVATE: yes
           Fort ABSTRACT: yes
       Fort ASYNCHRONOUS: yes
          Fort PROCEDURE: yes
         Fort USE...ONLY: yes
           Fort C_FUNLOC: yes
 Fort f08 using wrappers: yes
         Fort MPI_SIZEOF: yes
             C profiling: yes
   Fort mpif.h profiling: yes
  Fort use mpi profiling: yes
   Fort use mpi_f08 prof: yes
          Thread support: posix (MPI_THREAD_MULTIPLE: yes, OPAL support: yes,
                          OMPI progress: no, Event lib: yes)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: yes
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
              dl support: yes
   Heterogeneous support: no
       MPI_WTIME support: native
     Symbol vis. support: yes
   Host topology support: yes
            IPv6 support: no
          MPI extensions: affinity, cuda, ftmpi, rocm, shortfloat
 Fault Tolerance support: yes
          FT MPI support: yes
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
         MCA accelerator: null (MCA v2.1.0, API v1.0.0, Component v5.1.0)
           MCA allocator: basic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
           MCA allocator: bucket (MCA v2.1.0, API v2.0.0, Component v5.1.0)
           MCA backtrace: execinfo (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA btl: self (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: sm (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: tcp (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                 MCA btl: uct (MCA v2.1.0, API v3.3.0, Component v5.1.0)
                  MCA dl: dlopen (MCA v2.1.0, API v1.0.0, Component v5.1.0)
                  MCA if: linux_ipv6 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
                  MCA if: posix_ipv4 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
         MCA installdirs: env (MCA v2.1.0, API v2.0.0, Component v5.1.0)
         MCA installdirs: config (MCA v2.1.0, API v2.0.0, Component v5.1.0)
              MCA memory: patcher (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA mpool: hugepage (MCA v2.1.0, API v3.1.0, Component v5.1.0)
             MCA patcher: overwrite (MCA v2.1.0, API v1.0.0, Component
                          v5.1.0)
              MCA rcache: grdma (MCA v2.1.0, API v3.3.0, Component v5.1.0)
           MCA reachable: weighted (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: mmap (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA shmem: sysv (MCA v2.1.0, API v2.0.0, Component v5.1.0)
             MCA threads: pthreads (MCA v2.1.0, API v1.0.0, Component v5.1.0)
               MCA timer: linux (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA bml: r2 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                MCA coll: adapt (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: basic (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: han (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: inter (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: libnbc (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: self (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: sync (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: tuned (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: ftagree (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA coll: sm (MCA v2.1.0, API v2.4.0, Component v5.1.0)
                MCA fbtl: posix (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA fcoll: dynamic (MCA v2.1.0, API v2.0.0, Component v5.1.0)
               MCA fcoll: dynamic_gen2 (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
               MCA fcoll: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
               MCA fcoll: vulcan (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA fs: lustre (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                  MCA fs: ufs (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA hook: comm_method (MCA v2.1.0, API v1.0.0, Component
                          v5.1.0)
                  MCA io: ompio (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                 MCA osc: sm (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                 MCA osc: rdma (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                 MCA osc: ucx (MCA v2.1.0, API v3.0.0, Component v5.1.0)
                MCA part: persist (MCA v2.1.0, API v4.0.0, Component v5.1.0)
                 MCA pml: cm (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: ob1 (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: ucx (MCA v2.1.0, API v2.1.0, Component v5.1.0)
                 MCA pml: v (MCA v2.1.0, API v2.1.0, Component v5.1.0)
            MCA sharedfp: individual (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
            MCA sharedfp: lockedfile (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
            MCA sharedfp: sm (MCA v2.1.0, API v2.0.0, Component v5.1.0)
                MCA topo: basic (MCA v2.1.0, API v2.2.0, Component v5.1.0)
                MCA topo: treematch (MCA v2.1.0, API v2.2.0, Component
                          v5.1.0)
           MCA vprotocol: pessimist (MCA v2.1.0, API v2.0.0, Component
                          v5.1.0)
bash-4.2$ 

s-sajid-ali avatar Apr 15 '23 00:04 s-sajid-ali

The issue is that want Synergia to be usable on as wide variety of platforms as possible. While NFS mounted disks are not recommended for parallel IO, that might be the environment some users, for example on a university cluster, have available.

egstern avatar Apr 19 '23 13:04 egstern

Thanks for the report! Let's dig into this, could be a bug on our end, an HDF5 bug or an MPI-I/O bug.

Your comment here makes me suspect a ROMIO issue, which is a common MPI-I/O implementation used.

@s-sajid-ali In your error message, I see

minor: Can't perform independent IO

Can you rerun with export OPENPMD_HDF5_INDEPENDENT=OFF and check if this crashes as well? We can also set this programmatically if needed.

If your previous parallel HDF5 routines do not show the same problem, then I suspect an issue on our end that we can fix! Can you confirm that this is the case?

ax3l avatar Apr 19 '23 16:04 ax3l

@s-sajid-ali can you double check the same issue appears with HDF5 1.12.2? HDF5 1.14.0 is relatively new with major upgrades in the Virtual Object Layer (VOL) and this might be a regression.

CCing @jeanbez for potential insights on what to try for HDF5 on NFS?

ax3l avatar Apr 19 '23 16:04 ax3l

The issue is that want Synergia to be usable on as wide variety of platforms as possible. While NFS mounted disks are not recommended for parallel IO, that might be the environment some users, for example on a university cluster, have available.

Absolutely, that is also our goal. Adding NFS guidance in #1427.

@s-sajid-ali @egstern can you confirm that other openPMD backends, e.g., ADIOS2 v2.8.3 do not show issues with your file systems?

ax3l avatar Apr 19 '23 16:04 ax3l

@s-sajid-ali can you potentially add some printf debugging to the writes of /data/0/particles/track_coords/moments/x? Is there something special about it, e.g., writing zero particles in a chunk or similar? Which chunks where written before it crashes?

ax3l avatar Apr 19 '23 16:04 ax3l

@s-sajid-ali could you try setting HDF5_DO_MPI_FILE_SYNC=FALSE ?

jeanbez avatar Apr 19 '23 16:04 jeanbez

Can you rerun with export OPENPMD_HDF5_INDEPENDENT=OFF and check if this crashes as well?

@ax3l : No crash occurs with this environment variable set.

could you try setting HDF5_DO_MPI_FILE_SYNC=FALSE ?

@jeanbez : No crash with this either.

s-sajid-ali avatar Apr 19 '23 20:04 s-sajid-ali

Your https://github.com/openPMD/openPMD-api/issues/1423#issuecomment-1509430878 makes me suspect a ROMIO issue, which is a common MPI-I/O implementation used.

The UnifyFS docs seem to mention setting ROMIO hint romio_synchronizing_flush as a more efficient alternate method to achieve the same. Is this something that could be used as a default in OpenPMD?

s-sajid-ali avatar Apr 19 '23 20:04 s-sajid-ali

@s-sajid-ali @egstern can you confirm that other openPMD backends, e.g., ADIOS2 v2.8.3 do not show issues with your file systems?

@ax3l : No crash with the ADIOS2 back-end.

s-sajid-ali avatar Apr 19 '23 21:04 s-sajid-ali

The UnifyFS docs seem to mention setting ROMIO hint romio_synchronizing_flush as a more efficient alternate method to achieve the same. Is this something that could be used as a default in OpenPMD?

This sounds to me like it is specific to ROMIO, OpenMPI uses their own MPI-I/O backend by default. Thus, it would not be portable and it would also reduce performance on all other, truly parallel filesystems.

ax3l avatar Apr 20 '23 18:04 ax3l