kerchunk icon indicating copy to clipboard operation
kerchunk copied to clipboard

Nested HDF5 Data / HEC-RAS

Open thwllms opened this issue 6 months ago • 4 comments

I'm working on development of the rashdf library for reading HEC-RAS HDF5 data. A big part of the motivation for development of the library is stochastic hydrologic/hydraulic modeling.

We want to be able to generate Zarr metadata for stochastic HEC-RAS outputs, so that e.g., results for many different stochastic flood simulations from a given RAS model can be opened as a single xarray Dataset. For example, results for 100 different simulations could be concatenated in a new simulation dimension, with coordinates being the index number of each simulation. It took me a little while to figure out how to make that happen because RAS HDF5 data is highly nested and doesn't conform to typical conventions.

The way I implemented it is hacky:

  1. Given an xr.Dataset pulled from the HDF file and the path of each child xr.DataArray within the HDF file,
  2. Get the filters for each DataArray: filters = SingleHdf5ToZarr._decode_filters(None, hdf_ds)
  3. Get the storage info for each DataArray: storage_info = SingleHdf5ToZarr._storage_info(None, hdf_ds)
  4. Build out metadata for chunks using storage_info
  5. "Write" the xr.Dataset to a zarr.MemoryStore with compute=False, to generate the framework of what's needed for the Zarr metadata
  6. Read the objects generated by writing to zarr.MemoryStore and decode
  7. Assemble the zarr.MemoryStore objects, filters, and storage_info into a dictionary and finally return

I suppose my questions are:

  • Is there a better way to approach highly nested or otherwise idiosyncratic HDF5 data with Kerchunk?
  • Could Kerchunk's SingleHdf5ToZarr._decode_filters and _storage_info methods be made public?

thwllms avatar Aug 08 '24 16:08 thwllms