sdc
sdc copied to clipboard
Question about parallel file I/O with HDF5
I have some basic questions about how HPAT does I/O with HDF5 files...
For example, in this program:
@hpat.jit
def example_1D(n):
f = h5py.File("data.h5", "r")
A = f['A'][:]
return np.sum(A)
- Is each MPI process reading a portion of f['A'] ?
- Would it make sense to use the parallel HDF5 library?
- How are writes to an HDF5 dataset handled?
- I've only tried HPAT on a single machine. Any issues when running on a MPI cluster?
- Is it possible to use MPI explicitly in a HPAT program?
- Does HPAT look at the chunk layout for HDF5 datasets to determine how to partition them?
- Is there anyway to use HPAT from within a Juypter notebook?
- Yes.
- HPAT already uses parallel HDF5 (collective I/O). The HDF5 version installed through the channel with HPAT is parallel. Also see: https://github.com/IntelLabs/hpat/blob/master/hpat/_io.cpp#L111
- Similar to reads - collective I/O is used.
- No.
- It is possible but the user interface needs to be improved.
- No. Dataset partitioning is simply 1D block distribution among ranks.
- It can be used sequentially obviously, but we need to build support for parallelism. Looks like they already have the MPI infrastructure so this shouldn't be too hard: https://ipython.org/ipython-doc/3/parallel/parallel_mpi.html