kvikio icon indicating copy to clipboard operation
kvikio copied to clipboard

[Discussion] HDF5+GDS+multi-threading

Open madsbk opened this issue 2 years ago • 5 comments

In #287, we propose to implement a Virtual File Driver (VFD) that uses KvikIO to accelerate HDF5 IO. However, HDF5 isn’t thread-safe thus implementing a VFD might be of limited interest to projects like Legate that make heavy use of muti-threading.

Notice, it is possible to compile HDF5 with --enable-threadsafe but it effectively makes the entire HDF5 library a giant critical region. There is a RFC to make HDF5 (or part of it) thread-safe, RFC: Multi-Thread HDF5, but it is not coming soon.

Let’s look at some alternative approaches that supports both GDS and multi-threading:

1. Use Kerchunk

  • Easy to support GDS through KvikIO
  • We would need to extent Kerchunk to support virtual dataset
  • Only read support
  • Only support basic HDF5 (no endianness change etc.)

2. Parse the HDF5 metadata and extract contiguous data blocks ourselves

  • Easy to support GDS through KvikIO
  • Support decompression on-the-fly using nvCOMP
  • Support writing by locally creating empty HDF5 files and then fill them with data in parallel.
  • Only support basic HDF5 (no endianness change etc.)
  • Harder to support compression since we don’t know the size of the data blocks on disk in this case.

3. Wait for multi-thread support in HDF5

  • Supports read and write of any HDF5 file.
  • It might take a look time for something like RFC: Multi-Thread HDF5 to be released.
  • Hard to support GDS, we need to implement a VFD that uses KvikIO.
  • Might be hard to support on-the-fly GPU compression and decompression.

Any thoughts?

madsbk avatar Oct 02 '23 12:10 madsbk

[...]

Any thoughts?

Note that I do not know a lot about the details of this VFD interface in HDF5. So I am therefore maybe being naive.

At what level do you need thread-safety in the VFD interface? It looks to me like you're providing callbacks for read/write that HDF5 can use. If the HDF5 calls are single-threaded, you can presumably do whatever you like internally as long as you expose a "single-thread consistent" interface to HDF5.

Or is it not that easy?

wence- avatar Oct 02 '23 14:10 wence-

If the HDF5 calls are single-threaded, you can presumably do whatever you like internally as long as you expose a "single-thread consistent" interface to HDF5.

Correct, the VFD itself can be multi-threaded but Legate uses threads (as opposed to processes) when parallelizing tasks on the same node. E.g., if two Legate tasks runs on the same machine, their calls to hdf5 must be serialized.

madsbk avatar Oct 02 '23 14:10 madsbk

  1. Supporting writes is pretty important, so I would vote against relying on Kerchunk for the long term.

  2. I am favorable to this one, more comments after (3)

  3. Legate specifically might be OK with single-thread-per-process (or at least serialized access from different threads within the same process), so the VFD approach doesn't need to wait on multi-threaded HDF5, at least for Legate. The reason is that we may have to switch to a rank-per-GPU default anyway (for the benefit of other libraries that just don't work under rank-per-node).

    The more fundamental problem for Legate is that we would have multiple processes trying to read/write the same HDF5 file; can the VFD approach handle that mode? On another thread you linked to https://forum.hdfgroup.org/t/parallel-read-of-a-single-hdf5-file/7960/4, which seems to suggest that the (only?) way to get safe multi-process access is to use an MPI-based VFD, and Legate is trying to move away from depending on MPI (as that throws a wrench e.g. on redistributability of builds).

    Implementing a Legate+Kvikio-aware VFD might be even more work than (2), but it would presumably work out-of-the-box with all HDF5 features.

    Also, you possibly have less control over how the underlying file I/O is invoked, so it might not be done in the most performant way possible (this is speculative; possibly this is not an issue, depending on what contract the VFD interface provides to the implementor).

    Note: All of the above is from the point of view of Legate; other clients might be more strict about the need for true multi-threading, and not care about including MPI.

So at this point I believe the question is, is it better to go through the "official" VFD extension interface, or only use the HDF5 API up to the point where we get access to the underlying buffers, and from that point on proceed independently. The latter would be less constrained by the main HDF5 library's quirks, and would have clearer performance characteristics, but wouldn't be as fully-featured. Which alternative is more programming effort is unclear.

I am favorable towards (2), but I am absolutely not an expert here.

manopapad avatar Oct 02 '23 19:10 manopapad

The more fundamental problem for Legate is that we would have multiple processes trying to read/write the same HDF5 file; can the VFD approach handle that mode?

In principle, yes. The MPI backend in hdf5 is implemented using a VFD approach. For reading, this should be straightforward but in order to support writing, we would have to implement something similar to MPIO VFD.

madsbk avatar Oct 03 '23 07:10 madsbk

So at this point I believe the question is, is it better to go through the "official" VFD extension interface, or only use the HDF5 API up to the point where we get access to the underlying buffers, and from that point on proceed independently. The latter would be less constrained by the main HDF5 library's quirks, and would have clearer performance characteristics, but wouldn't be as fully-featured.

Very well put.

Which alternative is more programming effort is unclear.

That I can answer, option (2) is significant less work. Particularly, if we want to support parallel write to a single file (uncompressed).

madsbk avatar Oct 03 '23 07:10 madsbk