hdmf icon indicating copy to clipboard operation
hdmf copied to clipboard

[Feature]: Add support for overriding backend configuration in HDF5 datasets

Open pauladkisson opened this issue 1 year ago • 3 comments

What would you like to see added to HDMF?

I am working on a new helper feature for neuroconv, in which users can repack an NWB file with new backend configurations (https://github.com/catalystneuro/neuroconv/pull/1003). However when I try to export the NWB file with the new backend configurations, I get a user warning and the new backend configuration is ignored.

/opt/anaconda3/envs/neuroconv_tdtfp_env/lib/python3.12/site-packages/hdmf/utils.py:668: UserWarning: chunks in H5DataIO will be ignored with H5DataIO.data being an HDF5 dataset

What solution would you like?

I was able to solve this problem by simply converting the HDF5 dataset to a numpy array like so:

  # hdmf.container.Container
  def set_data_io(self, dataset_name: str, data_io_class: Type[DataIO], data_io_kwargs: dict = None, **kwargs):
          """
          Apply DataIO object to a dataset field of the Container.
  
          Parameters
          ----------
          dataset_name: str
              Name of dataset to wrap in DataIO
          data_io_class: Type[DataIO]
              Class to use for DataIO, e.g. H5DataIO or ZarrDataIO
          data_io_kwargs: dict
              keyword arguments passed to the constructor of the DataIO class.
          **kwargs:
              DEPRECATED. Use data_io_kwargs instead.
              kwargs are passed to the constructor of the DataIO class.
          """
          if kwargs or (data_io_kwargs is None):
              warn(
                  "Use of **kwargs in Container.set_data_io() is deprecated. Please pass the DataIO kwargs as a "
                  "dictionary to the `data_io_kwargs` parameter instead.",
                  DeprecationWarning,
                  stacklevel=2
              )
              data_io_kwargs = kwargs
          data = self.fields.get(dataset_name)
+         data = np.array(data)
          if data is None:
              raise ValueError(f"{dataset_name} is None and cannot be wrapped in a DataIO class")
          self.fields[dataset_name] = data_io_class(data=data, **data_io_kwargs)

I would appreciate some kind of alternative set_data_io() function that supports overwriting HDF5 data sets in this manner (or something similar).

Do you have any interest in helping implement the feature?

Yes.

pauladkisson avatar Aug 14 '24 21:08 pauladkisson

To copy datasets on export the HDF5IO backend uses the copy method from h5py:

https://github.com/hdmf-dev/hdmf/blob/49a60df21016fb4da83d64e84e7bc99dac03e94e/src/hdmf/backends/hdf5/h5tools.py#L1179-L1191

which does not support changing of chunking, compression etc.. Converting to np.array is not ideal because it loads the entire data into memory, which is problematic for large arrays. Instead, to avoid loading all the data at once, you could wrap the dataset with some variant of AbstractDataChunkIterator so that the data is being loaded and written in larger data blocks (instead of all at once). However, if the shape of the chunks in the source dataset A and the target dataset B are not well aligned then copying the data iteratively can become quite expensive.

A possible option may be to modify set_data_io to support both wrapping with DataIO and AbstractDataChunkIterator, i.e.,:

  1. Add dci_cls: Type[AbstractDataChunkIterator] = hdmf.data_utils.GenericDataChunkIterator as a parameter so that a user can specify what type of iterator to use and default to the GenericDataChunkIterator which would be a good default
  2. Add dci_kwargs: dict = None so that a user can optionally provide the parameters for GenericDataChunkIterator
  3. If if dci_kwargs is not None then wrap the dataset first with the DataChunkIterator before wrapping it with DataIO

oruebel avatar Aug 15 '24 04:08 oruebel

@pauladkisson would you want to take a stab at making a PR for this?

oruebel avatar Aug 15 '24 04:08 oruebel

@oruebel, thanks for the detailed explanation! I figured the np.array solution wouldn't be ideal, but I wasn't totally sure why.

would you want to take a stab at making a PR for this?

Yeah, I can give it a go.

pauladkisson avatar Aug 15 '24 16:08 pauladkisson