sdc icon indicating copy to clipboard operation
sdc copied to clipboard

Question about parallel file I/O with HDF5

Open jreadey opened this issue 7 years ago • 1 comments

I have some basic questions about how HPAT does I/O with HDF5 files...

For example, in this program:

@hpat.jit
def example_1D(n):
    f = h5py.File("data.h5", "r")
    A = f['A'][:]
    return np.sum(A)
  1. Is each MPI process reading a portion of f['A'] ?
  2. Would it make sense to use the parallel HDF5 library?
  3. How are writes to an HDF5 dataset handled?
  4. I've only tried HPAT on a single machine. Any issues when running on a MPI cluster?
  5. Is it possible to use MPI explicitly in a HPAT program?
  6. Does HPAT look at the chunk layout for HDF5 datasets to determine how to partition them?
  7. Is there anyway to use HPAT from within a Juypter notebook?

jreadey avatar Jan 25 '18 03:01 jreadey

  1. Yes.
  2. HPAT already uses parallel HDF5 (collective I/O). The HDF5 version installed through the channel with HPAT is parallel. Also see: https://github.com/IntelLabs/hpat/blob/master/hpat/_io.cpp#L111
  3. Similar to reads - collective I/O is used.
  4. No.
  5. It is possible but the user interface needs to be improved.
  6. No. Dataset partitioning is simply 1D block distribution among ranks.
  7. It can be used sequentially obviously, but we need to build support for parallelism. Looks like they already have the MPI infrastructure so this shouldn't be too hard: https://ipython.org/ipython-doc/3/parallel/parallel_mpi.html

ehsantn avatar Jan 25 '18 03:01 ehsantn