VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

Readers should raise an error for HDF files using the compact storage layout.

Open sharkinsspatial opened this issue 9 months ago • 3 comments

While investigating possible HDF storage scenarios for scalar values from https://github.com/zarr-developers/VirtualiZarr/pull/523 I discovered that HDF also supports a "compact" storage layout where extremely small datasets or values (<64KB) are inlined into the file header https://support.hdfgroup.org/documentation/hdf5/latest/_l_b_dset_layout.html. The HDF5 lib has no support for inferring the offset and size of datasets stored using the compact layout so we have no way of creating a ChunkManifest for them and should raise an unsupported exception.

  • [ ] Create a test fixture with a scalar stored in the compact storage layout using the low-level h5py.h5d API.
  • [ ] Update HDFVirtualBackend to check the dataset's storage layout.

sharkinsspatial avatar Apr 02 '25 23:04 sharkinsspatial

Hmm, this is potentially an issue with the whole "readers as creators of ManifestStores" idea. We can put this inlined data into a virtual dataset and into Icechunk, it just can't be a virtual variable. (Or at least the HDF library won't help us if we want to make that virtual variable.)

If reader implementations had the ability to say "nah actually you're getting this variable in memory" then we could deal with this situation gracefully.

A compromise might be to have the error message suggest explicitly loading that particular variable.

TomNicholas avatar Apr 03 '25 02:04 TomNicholas

@TomNicholas I think including the suggestion to load the problematic variables is probably the way to go 👍. I'm hopeful that this is will be a fairly infrequent case.

sharkinsspatial avatar Apr 03 '25 20:04 sharkinsspatial

BTW a similar thing would happen if you try to load kerchunk references that are inlined into the kerchunk reference file. But in that case it's easier to generate a reference to the data (which lives in the kerchunk json file).

TomNicholas avatar Apr 03 '25 20:04 TomNicholas