kvikio HDF5 Direct Access

As discussed in #295, we have multiple approaches to support HDF5 files. Let's look at approach (2) in more detail.

The idea is to parse HDF5 metadata and extract contiguous data blocks similar to how Kerchunk's HDF5 backend works but we want to support both read and write.

Features

Read and write access to HDF5 files.
GDS accelerated read/write access.
Multi-threaded read/write access.
On-the-fly compression and decompression of KvikIO written HDF5 files using nvCOMP.
- It is going to hard to support compression that are compatible with the ones built into HDF5. Instead, KvikIO will implement its own compression layer above the HDF5 file and store compression information as HDF5 attributes.

Limitations

We are going to have the same limitations as HDF5's direct write function H5Dwrite_chunk():

No native compression (other then the KvikIO specific compression)
No filters.
No datatype conversion.
No endianness conversion.
No user-defined functions.
No variable length data types.
- We might be able to support strings.
Also, see the paper: Using the Direct Chunk Write Function.

Implementation

For the initial implementation, we do all metadata manipulation in Python using h5py. Later, to reduce overhead and make it available in C++, we can port it to C++ and use the official HDF5 library or maybe a more high-level library like HighFive or h5cpp.

Write HDF5 Dataset

Compress data using nvCOMP, optional.
Use h5py to write an empty dataset using the ALLOC_TIME_EARLY option, which will make sure the data blocks within the hdf5 file are allocated immediately.
- Notice, the ALLOC_TIME_EARLY option only works when HDF5 compression is disabled.
Write a HDF5 attribute that describes the compression algorithm used (if any).
Use h5py to parse the HDF5 metadata.
Translate the metadata into a set of data blocks (file, offset, size) .
Use KvikIO to write from the input buffer to the data blocks.

Read HDF5 Dataset

Use h5py to parse the HDF5 metadata
Read HDF5 attributes to determine the decompression algorithm.
Translate the metadata into a set of data blocks (file, offset, size) .
Use KvikIO to read the data blocks into the output buffer (device or host memory).
- Optionally, decompress the data on-the-fly using nvCOMP.

Oct 04 '23 11:10 madsbk

Instead, KvikIO will implement its own compression layer above the HDF5 file and store compression information as HDF5 attributes.

Would this be a "custom" extension? I.e. a third-party app that receives a kvikio-compressed HDF5 file wouldn't know what the attributes mean, therefore it wouldn't know that the file is compressed in some way?

No native compression (other then the KvikIO specific compression)

Reading the paper you linked it appears that when using H5Dwrite_chunk you can still record on the HDF5 metadata that each chunk is compressed, and it's just your responsibility to make sure the data you write is already compressed (and you must use gzip). In that case any consumer of the file would know how the data is compressed.

No filters. No datatype conversion. No endianness conversion. No user-defined functions.

On Legate land we can live without these for now. We can hope that in the future the Legate core will be able to detect that a data-parallel load can be fused with an element-wise conversion that comes after, and thus get the same performance as if the I/O library provided these transformations internally.

No variable length data types. We might be able to support strings.

As far as in-memory representation goes, Legate would store, say, a string array using two stores, one for the character data and one for the offsets. The partitioning between the two would be consistent (e.g. all the offsets within chunk 5 of the "offsets" store would point to characters within chunk 5 of the "characters" store). Saving this directly as two datasets to a chunked HDF5 file sounds problematic, because it is quite unlikely that all the chunks on each store would have the same size. Perhaps there could be added padding to make all sizes uniform.

I couldn't easily find out how HDF5 handles chunked variable-length arrays internally.

Oct 05 '23 07:10 manopapad

Would this be a "custom" extension? I.e. a third-party app that receives a kvikio-compressed HDF5 file wouldn't know what the attributes mean, therefore it wouldn't know that the file is compressed in some way?

Yes, a third-party wouldn't be able to read the kvikio-compressed HDF5 file.

Reading the paper you linked it appears that when using H5Dwrite_chunk you can still record on the HDF5 metadata that each chunk is compressed, and it's just your responsibility to make sure the data you write is already compressed (and you must use gzip). In that case any consumer of the file would know how the data is compressed.

Right, using H5Dwrite_chunk would work but we wouldn't get the performance of KvikIO/GDS since HDF5 itself would do the writing. Also, H5Dwrite_chunk isn't thread-safe.

Saving this directly as two datasets to a chunked HDF5 file sounds problematic, because it is quite unlikely that all the chunks on each store would have the same size. Perhaps there could be added padding to make all sizes uniform.

Agree, but notice that we are not bound by the chunks in HDF5. E.g., Legate tasks can read multiple chunks or even partial chunks. The advantages of extracting all data blocks offsets beforehand is that we can access the data blocks in any way we like; including changing the decomposition on-the-fly.

Oct 05 '23 07:10 madsbk

Right, using H5Dwrite_chunk would work but we wouldn't get the performance of KvikIO/GDS since HDF5 itself would do the writing. Also, H5Dwrite_chunk isn't thread-safe.

Is it possible to preallocate a chunked dataset, then query hdf5 for the file name, file offset and extent corresponding to each chunk? If we can do that, then we could potentially record on the metadata that each chunk will be zlib-compressed, then copy each chunk in its entirety (already compressed) from framebuffer to disk using GDS (without having to go through H5Dwrite_chunk).

Oct 05 '23 22:10 manopapad

Is it possible to preallocate a chunked dataset, then query hdf5 for the file name, file offset and extent corresponding to each chunk?

Yes, this is exactly what I mean in step 5: Translate the metadata into a set of data blocks (file, offset, size).

If we can do that, then we could potentially record on the metadata that each chunk will be zlib-compressed, then copy each chunk in its entirety (already compressed) from framebuffer to disk using GDS (without having to go through H5Dwrite_chunk).

Good point. However, modifying the metadata might be tricky and hard to maintain but definitely a possibility!

Oct 06 '23 07:10 madsbk

Yes, this is exactly what I mean in step 5: Translate the metadata into a set of data blocks (file, offset, size).

Yup, my bad, I didn't read that carefully.

Good point. However, modifying the metadata might be tricky and hard to maintain but definitely a possibility!

Agreed, that is definitely a risk, but it also gives us the best chance of interoperating with downstream/upstream apps, that may be reading/writing their HDF5 files outside of KvikIO.

Oct 06 '23 18:10 manopapad

Correct me if I am wrong, step-1 (for READ) and step-4(for WRITE) essentially would be single threaded and once the metadata is accessed, then the we can perform multi-threaded I/O through kvikio/gds. What I missing here is when multiple processes want to access the same file then how the hdf5 metadata will be synchronized among them to give a consistent view.

Oct 18 '23 21:10 tell-rebanta

We can ask Legate to run only one copy of the task which processes the HDF5 metadata (i.e. not replicated across the cluster), then broadcast the results to the other processes. Then all the processes can do the actual reads and writes in parallel.

Oct 18 '23 22:10 manopapad

Depending upon the outcome of the actual I/O especially in case of an error, we may once again need to consolidate the meta data - isn't it ?

Oct 18 '23 23:10 tell-rebanta

That's a good point. We could do a similar singleton task launch that updates the metadata based on the status reports of the parallel readers/writers, but I guess that depends on what error cases the workers might encounter.

Oct 18 '23 23:10 manopapad

There's a few clarifications that may help:

HDF5 is thread safe, but it doesn't have multi-threaded concurrency (so only one thread can currently be in HDF5 at a time).
I advise against trying to bypass the HDF5 library for reading or writing data to the file, if possible. The library has some constraints on data placement and layout and it's unlikely that KvikIO wants to be responsible for those.
I know how variable-length data / strings are stored. :-). It's probably too detailed to described here, but I can talk about it in person.

Feb 01 '24 21:02 qkoziol

It is going to hard to support compression that are compatible with the ones built into HDF5.

This is not necessarily true. It would be if using the nvCOMP high level API which would be the most natural fit for custom HDF5 filters. But if you can use the nvCOMP low level API, those are fully compatible with the standard stream formats (including gzip). This is harder to integrate into HDF5 though since it requires batched decompression. @qkoziol mentioned that this might be possible though :)

Apr 03 '24 01:04 akshaysubr

kvikio kvikio copied to clipboard

HDF5 Direct Access

Features

Limitations

Implementation

Write HDF5 Dataset

Read HDF5 Dataset

kvikio
kvikio copied to clipboard