Group Backend Keyword Arguments
- [x] Closes #10377
- [ ] Tests added
- [ ] User visible changes (including notable bug fixes) are documented in
whats-new.rst - [ ] New functions/methods are listed in
api.rst
This is a first attempt and base for discussion.
This PR does the following:
- split
open_datasetkwargs into four groups: Here I followed @shoyer's suggestion to use dataclasses https://github.com/pydata/xarray/issues/4490.
- coder_opts: options for CF coders (eg. mask_and_scale, decode_times)
- open_opts: options for the backend file opener (eg. driver, clobber, diskless, format)
- backend_opts: options for xarray (eg. chunk, cache, inline_array)
- store_opts: options for the backend store (eg. group, lock, autoclose)
- define these classes in
BackendEntrypointand override them in the subclasses. for now only for netcdf4/h5netcdf backends - implement logic into
open_dataset - implement logic into
to_netcdf - for backwards compatibility reinitialize the above options with the given kwargs as needed
Example usage:
# simple call, use backend default options
ds = xr.open_dataset("test.nc", engine="netcdf4") # simple call
# define once, use many , these should be imported from the backend
open_opts = NetCDF4OpenOptions(auto_complex=True)
coder_opts = NetCDF4CoderOptions(decode_times=False, mask_and_scale=False)
backend_opts = XarrayBackendOptions(chunk={"time": 10})
store_opts = NetCDF4StoreOptions(group="test")
# engine could also be the `BackenEntryPoint`
ds = xr.open_dataset("test.nc", engine="netcdf4", open_opts=open_opts, coder_opts=coder_opts, backend_opts=backend_opts, store_opts=store_opts)
CONS:
- Most users might not need to use these added options at all, but could fallback to current behaviour
- Users might complain about the additional complexity for setting up the dataclasses
- tbc.
PROS:
- strict separation of kwargs/options
- easy forwarding
- per backend kwargs/options
- easy adding kwargs/options
- tbc.
What this PR still needs to do:
- implement everything above for the other built-in backends (zarr, scipy, pydap, etc.)
I have follow-up ideas:
-
implement
save_datasetinBackendEntrypointto write to the engine's native format, liketo_netcdfwould be for scipy/netcdf4/h5netcdf andto_zarrwould be for zarr. With that we could do the writing with a unified API, something like:ds = xr.open_dataset("test.nc", engine="netcdf4") # Dataset API ds.save_dataset("test.zarr", engine="zarr) ds.save_dataset("test2.nc", engine="netcdf4") # general API xr.save_dataset(ds, "test2.nc", engine="netcdf4") ds.save_dataset("test.grib", engine="grib") # my imagination ds.save_dataset("test.hdf5", engine="hdf5") # my imagination -
further disentangle the current built-in backends from xarray so that they could be their own module
I'm sure I have not taken into account all the possible pitfalls/problems which might arise here. I'd appreciate any comments and suggestions.
Please have a look at #10429, where I've split out the cf coder related kwargs grouping.
To summarize what I argued for after the end of the meeting today, I think we should slowly transition to an API where we pass the entire decoding chain as a sequence of functions / callable objects into xr.open_dataset that would be executed in that order they were passed. Additionally, backends should have the option to disable certain builtin coders (this is especially important when encoding).
This would require a lot of thought to figure out a good API, and even more to find a good way to transition towards that. I think this would make extending the coders a lot easier, and possibly pave the way towards dataset coders (or rather, multi-variable coders).
I think it might be possible to change the dataclass added in this PR to act as a bridge towards the idea in https://github.com/pydata/xarray/issues/4490#issuecomment-2299325353 (which should probably be extended to allow other libraries / backends to modify that chain).