[Feature]: Add support for overriding backend configuration in HDF5 datasets
What would you like to see added to HDMF?
I am working on a new helper feature for neuroconv, in which users can repack an NWB file with new backend configurations (https://github.com/catalystneuro/neuroconv/pull/1003). However when I try to export the NWB file with the new backend configurations, I get a user warning and the new backend configuration is ignored.
/opt/anaconda3/envs/neuroconv_tdtfp_env/lib/python3.12/site-packages/hdmf/utils.py:668: UserWarning: chunks in H5DataIO will be ignored with H5DataIO.data being an HDF5 dataset
What solution would you like?
I was able to solve this problem by simply converting the HDF5 dataset to a numpy array like so:
# hdmf.container.Container
def set_data_io(self, dataset_name: str, data_io_class: Type[DataIO], data_io_kwargs: dict = None, **kwargs):
"""
Apply DataIO object to a dataset field of the Container.
Parameters
----------
dataset_name: str
Name of dataset to wrap in DataIO
data_io_class: Type[DataIO]
Class to use for DataIO, e.g. H5DataIO or ZarrDataIO
data_io_kwargs: dict
keyword arguments passed to the constructor of the DataIO class.
**kwargs:
DEPRECATED. Use data_io_kwargs instead.
kwargs are passed to the constructor of the DataIO class.
"""
if kwargs or (data_io_kwargs is None):
warn(
"Use of **kwargs in Container.set_data_io() is deprecated. Please pass the DataIO kwargs as a "
"dictionary to the `data_io_kwargs` parameter instead.",
DeprecationWarning,
stacklevel=2
)
data_io_kwargs = kwargs
data = self.fields.get(dataset_name)
+ data = np.array(data)
if data is None:
raise ValueError(f"{dataset_name} is None and cannot be wrapped in a DataIO class")
self.fields[dataset_name] = data_io_class(data=data, **data_io_kwargs)
I would appreciate some kind of alternative set_data_io() function that supports overwriting HDF5 data sets in this manner (or something similar).
Do you have any interest in helping implement the feature?
Yes.
To copy datasets on export the HDF5IO backend uses the copy method from h5py:
https://github.com/hdmf-dev/hdmf/blob/49a60df21016fb4da83d64e84e7bc99dac03e94e/src/hdmf/backends/hdf5/h5tools.py#L1179-L1191
which does not support changing of chunking, compression etc.. Converting to np.array is not ideal because it loads the entire data into memory, which is problematic for large arrays. Instead, to avoid loading all the data at once, you could wrap the dataset with some variant of AbstractDataChunkIterator so that the data is being loaded and written in larger data blocks (instead of all at once). However, if the shape of the chunks in the source dataset A and the target dataset B are not well aligned then copying the data iteratively can become quite expensive.
A possible option may be to modify set_data_io to support both wrapping with DataIO and AbstractDataChunkIterator, i.e.,:
- Add
dci_cls: Type[AbstractDataChunkIterator] = hdmf.data_utils.GenericDataChunkIteratoras a parameter so that a user can specify what type of iterator to use and default to theGenericDataChunkIteratorwhich would be a good default - Add
dci_kwargs: dict = Noneso that a user can optionally provide the parameters forGenericDataChunkIterator - If
if dci_kwargs is not Nonethen wrap the dataset first with theDataChunkIteratorbefore wrapping it withDataIO
@pauladkisson would you want to take a stab at making a PR for this?
@oruebel, thanks for the detailed explanation! I figured the np.array solution wouldn't be ideal, but I wasn't totally sure why.
would you want to take a stab at making a PR for this?
Yeah, I can give it a go.