conduit icon indicating copy to clipboard operation
conduit copied to clipboard

Document collective IO

Open jedbrown opened this issue 6 years ago • 7 comments

The extent of current documentation:

The conduit_relay_mpi_io library provides the conduit::relay::mpi::io namespace which includes variants of these methods which take a MPI Communicator. These variants pass the communicator to the underlying I/O interface to enable collective I/O. Relay currently only supports collective I/O for ADIOS. -- https://llnl-conduit.readthedocs.io/en/latest/relay_io.html#relay-i-o-path-based-interface

Some examples use conduit::relay::io_blueprint::save, for which I see no MPI-equipped variant and have no symbols in the compiled libraries. It looks like perhaps conduit_relay_mpi_io_blueprint.hpp was an attempt to create that, but that header isn't public and I don't see how to access the feature.

The wording is ambiguous about whether one should be able to call collectively (with matching filename) for a non-ADIOS protocol (it just won't actually use MPI-IO to put bytes on disk) or whether that may cause conflict/corruption. In what circumstances must the caller distinguish file names by rank?

jedbrown avatar Oct 28 '19 21:10 jedbrown

thanks, yes we need to improve here.

That header is hint of where we want to go. While it is not MPI-IO, we have bits of MPI logic in VisIt and Ascent related specifically to meshes that we want to move into a conduit::relay::io_blueprint mpi variant to provide easier access to write and read specifically meshes in MPI jobs.

Most of our production MPI use cases are doing I/O at the generic conduit tree level - where the trees have some subtrees that conform to the blueprint and a host of other stuff as well. These cases are more complex than we want for conduit::relay::io_blueprint which we see as potential easier future path.

cyrush avatar Oct 28 '19 23:10 cyrush

Any chance you could drop in a reference implementation or point to an example that writes blueprint-compliant output from MPI? I'd like to merge PETSc support and don't want it to have unnecessary complexity or be a maintenance liability going forward.

jedbrown avatar Nov 27 '19 05:11 jedbrown

@jedbrown Does this work? #474

mclarsen avatar Nov 27 '19 16:11 mclarsen

Thanks, @mclarsen. I now have a PETSc MR almost ready.

I'd love to see a future code path that avoids explicit dependence on filesystem sequencing (create a directory that needs to be visible from all ranks) and num_ranks per output cycle filesystem metadata. Perhaps a future output "format" could wrap it into a single HDF5 file or a DAOS-native container?

jedbrown avatar Dec 02 '19 00:12 jedbrown

@jedbrown

Using a single HDF5 file is certainly possible -- using a baton to write, or broadcasting domains to a single writer for small cases.

We usually rely on multiple independent files to scale out in parallel, and then use batons to keep the number of files down at the highest scales.

Short term, we are going to expand the index to have an explicit domain to file map. We need this for cases where implicit ordering falls short (AMR is one)

DAOS and other object stores are interesting future directions.

cyrush avatar Dec 02 '19 19:12 cyrush

I don't follow a need for a baton; MPI-IO (H5Pset_fapl_mpio) should be capable of high performance, and (though I don't know the current state of implementation) native use of DAOS (bottom-left of this figure).

I know it may not be ideal for viz, but in application checkpointing and some in-situ analyses, we'd like to form a coherent (non-partitioned) data description. One way to do this is to add a global index to each cell/vertex, which can be composed to avoid the present duplication on subdomain interfaces.

jedbrown avatar Dec 04 '19 03:12 jedbrown

thanks @jedbrown, I'll have to ponder a bit more.

Collective MPI-IO support in HDF5 happens at the HDF5 dataset level. For our use cases that would mean at the conduit leaf level. We may be able to do something at the mesh blueprint level, but not at the general conduit tree level.

With the metadata in the blueprint for global offsets for local elements, for a subset of use cases, it is possible we could create effective global arrays and transform the data.

cyrush avatar Dec 04 '19 18:12 cyrush