kedro-plugins icon indicating copy to clipboard operation
kedro-plugins copied to clipboard

feat(datasets): `NetCDFDataset` support for `engine="h5netcdf"` [#620]

Open charlesbmi opened this issue 1 year ago • 6 comments

Description

https://github.com/kedro-org/kedro-plugins/issues/620

NetCDFDataset currently only supports engine="scipy", which uses an old version of NetCDF (NETCDF3). This is because the original implementation writes the data to a byte-buffer, for which xarray only supports engine="scipy".

This PR enables engine="h5netcdf" and engine="netcdf4", which support the more recent NETCDF4 HDF5 format.

Development notes

  • Based on suggestion by @astrojuanlu, I save to a temporary file and copy to fsspec when using a remote file-system.
  • Added pytest parametrize to use engine="h5netcdf" and engine="netcdf4"

Checklist

  • [x] Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • [x] Updated the documentation to reflect the code changes
  • [x] Added a description of this change in the relevant RELEASE.md file
  • [x] Added tests to cover my changes

The unit tests seem to be failing potentially for a separate reason, although I'm not quite sure. Does anyone have insight into this?

charlesbmi avatar Apr 01 '24 04:04 charlesbmi

Hi @charlesbmi, thanks for opening this PR! I see it's still in draft, do you need any help to get it ready for review?

merelcht avatar May 15 '24 16:05 merelcht

@merelcht thanks for checking in! Yes, help would be great.

I am not sure how to fix the doctest error:

_________ [doctest] kedro_datasets.netcdf.netcdf_dataset.NetCDFDataset _________
051         >>> from kedro_datasets.netcdf import NetCDFDataset
052         >>> import xarray as xr
053         >>> ds = xr.DataArray(
054         ...     [0, 1, 2], dims=["x"], coords={"x": [0, 1, 2]}, name="data"
055         ... ).to_dataset()
056         >>> dataset = NetCDFDataset(
057         ...     filepath="path/to/folder",
058         ...     save_args={"mode": "w"},
059         ... )
060         >>> dataset.save(ds)
UNEXPECTED EXCEPTION: DatasetError("Failed while saving data to data set NetCDFDataset(filepath=path/to/folder, load_args={}, protocol=file, save_args={'mode': w}).\n[Errno 13] Permission denied: '/home/runner/work/kedro-plugins/kedro-plugins/kedro-datasets/path/to/folder'")

It seems slightly separate from the code I've updated, and I'm not sure whether the runner is supposed to have permissions to add/modify random folders (i.e., presumably it is because path/to/ doesn't exist, and it is trying to create a path/to/folder?)

Would you be able to help me figure out how to resolve the test issue? Thanks!

charlesbmi avatar May 15 '24 20:05 charlesbmi

@charlesbmi I'll have a look 🙂

merelcht avatar May 16 '24 11:05 merelcht

@charlesbmi the doctest issue is now solved!

merelcht avatar May 16 '24 11:05 merelcht

Thank you @merelcht ! I'll fix up my lint/test errors, and then I'll update the documentation / RELEASE.md this week

charlesbmi avatar May 16 '24 18:05 charlesbmi

Fixed the rest of the tests and updated RELEASE.md, thanks for your help @merelcht !

charlesbmi avatar May 17 '24 17:05 charlesbmi

Thanks! Updated, and tests are passing now 🎉 I appreciate your help

charlesbmi avatar May 20 '24 16:05 charlesbmi

Thanks! Updated, and tests are passing now 🎉 I appreciate your help

You're welcome and thank you for your contribution! 😄 I've asked some other team members to review. We'll merge when we have 2 approvals.

merelcht avatar May 20 '24 16:05 merelcht