[Exploration]: Including dataarrays with our current dataset API model (#671 discussion)
Is your feature request related to a problem?
Refer to https://github.com/xCDAT/xcdat/discussions/671
This GitHub issue was just opened for tracking on the board project.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response
Using the spatial functionality, I started an example of how we might approach this goal with a new branch.
import xcdat as xc
fn = '/p/css03/esgf_publish/CMIP6/CMIP/MIROC/MIROC-ES2L/historical/r6i1p1f2/Amon/tas/gn/v20200318/tas_Amon_MIROC-ES2L_historical_r6i1p1f2_gn_185001-201412.nc'
ds = xc.open_mfdataset(fn)
tas = ds('tas')
tas.spatial.average()
<xarray.DataArray 'tas' (time: 1980)> Size: 16kB dask.array<truediv, shape=(1980,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates:
- time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 Attributes: standard_name: air_temperature long_name: Near-Surface Air Temperature comment: near-surface (usually, 2 meter) air temperature units: K original_name: T2 cell_methods: area: time: mean cell_measures: area: areacella history: 2019-12-27T22:22:52Z altered by CMOR: Treated scalar dime...
Using the
spatialfunctionality, I started an example of how we might approach this goal with a new branch.import xcdat as xc fn = '/p/css03/esgf_publish/CMIP6/CMIP/MIROC/MIROC-ES2L/historical/r6i1p1f2/Amon/tas/gn/v20200318/tas_Amon_MIROC-ES2L_historical_r6i1p1f2_gn_185001-201412.nc' ds = xc.open_mfdataset(fn) tas = ds('tas') tas.spatial.average()
<xarray.DataArray 'tas' (time: 1980)> Size: 16kB dask.array<truediv, shape=(1980,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates:
- time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 Attributes: standard_name: air_temperature long_name: Near-Surface Air Temperature comment: near-surface (usually, 2 meter) air temperature units: K original_name: T2 cell_methods: area: time: mean cell_measures: area: areacella history: 2019-12-27T22:22:52Z altered by CMOR: Treated scalar dime...
This is awesome! I'll take a closer look soon. Happy to see you got a working prototype.
The Xarray team is testing how index coordinates with non-array dimensions can be propagated (see PR #9671, PR #10116, and PR #10137). This update would allow DataArrays to include bounds. If adopted, we’ll need to revisit how xCDAT works with DataArrays.
Key Challenges:
-
Accessor Extension:
Dataset accessors can’t automatically work on DataArrays because they’re registered separately. One solution is to design a shared base class and then register separate accessors for each type. For example:import xarray as xr class BaseAccessor: def shared_method(self): # Shared implementation for both Dataset and DataArray return "shared result" @xr.register_dataset_accessor("my_accessor") class MyDatasetAccessor(BaseAccessor): def __init__(self, xarray_obj): self._obj = xarray_obj # Additional initialization specific to Datasets @xr.register_dataarray_accessor("my_accessor") class MyDataArrayAccessor(BaseAccessor): def __init__(self, xarray_obj): self._obj = xarray_obj # Additional initialization specific to DataArrays -
API Workflow:
xCDAT Dataset accessor APIs currently require adata_varstring to target a specific variable, and the operation returns a Dataset containing only that variable. We must update this logic to work seamlessly with DataArrays if we want to share the functionalities across accessor classes.
Questions to Consider:
- For each Dataset accessor (e.g., SpatialAccessor), should there be a corresponding DataArray accessor?
- How can we share code between these accessors to avoid duplication?
- What changes are necessary to ensure the accessors work correctly with both Datasets and DataArrays?
What if we create a function library that can accept either xr.DataArray or xr.Dataset objects:
-
xc.spatial.spatial_average(...) -
xc.temporal.departures(...) - ...
What if we create a function library that can accept either
xr.DataArrayorxr.Datasetobjects:* `xc.spatial.spatial_average(...)` * `xc.temporal.departures(...)` * ...
This is a possible option. We have some functions that operate like this.
Another approach is to create a DataArray accessor that mimics a Dataset accessor (similar to what you tried before). The steps are:
- Create a DataArray accessor equivalent to the Dataset accessor.
- Convert the DataArray to a Dataset.
- Call the Dataset accessor method.
- Convert the result back to a DataArray.
Implementation
import xarray as xr
@xr.register_dataset_accessor('spatial')
class SpatialAccessor:
def __init__(self, ds):
self._ds = ds
def average(self, data_var):
# Implement averaging for a Dataset
return self._ds[data_var].mean()
@xr.register_dataarray_accessor('spatial')
class DataArrayAccessor:
def __init__(self, da):
self._da = da
self.data_var = self._da.name
# Convert the DataArray to a Dataset using its name, or a default if None.
self._ds = da.to_dataset(name=da.name or "default")
def average(self):
# Delegate to the Dataset accessor and extract the DataArray by name.
return self._ds.spatial.average(self.data_var)[self.data_var]
Usage
import xcdat as xc
ds = xc.open_dataset(...)
tas = ds["tas"]
tas_avg = tas.spatial.average()
This approach lets you use the same accessor methods on both Datasets and DataArrays by converting between them as needed.
Pros:
- Code reuse: Implements logic once in the Dataset accessor.
- Consistent API: Allows the same method to be called on both Datasets and DataArrays.
- Modularity: Keeps conversion separate from core functionality.
Cons:
- Conversion overhead: Extra steps can impact performance.
- Name dependency: Relies on the DataArray having a valid name.
-
Handling edge cases: Structural differences between DataArrays and Datasets might lead to ambiguity.
- When converting a DataArray to a Dataset, you rely on the variable’s name and assume it fits neatly as a single variable within a Dataset structure. However, if the DataArray lacks a clear name or its associated metadata doesn't align well with the Dataset format, the conversion might not preserve all contextual details. This mismatch in structure can lead to ambiguity about which data or metadata is being processed or returned.
- Probably not an issue because xCDAT APIs tend to operate on a single data variable at a time and returns a Dataset with just that variable