xcdat icon indicating copy to clipboard operation
xcdat copied to clipboard

[Exploration]: Including dataarrays with our current dataset API model (#671 discussion)

Open tomvothecoder opened this issue 11 months ago • 5 comments

Is your feature request related to a problem?

Refer to https://github.com/xCDAT/xcdat/discussions/671

This GitHub issue was just opened for tracking on the board project.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

tomvothecoder avatar Jan 23 '25 18:01 tomvothecoder

Using the spatial functionality, I started an example of how we might approach this goal with a new branch.

import xcdat as xc
fn = '/p/css03/esgf_publish/CMIP6/CMIP/MIROC/MIROC-ES2L/historical/r6i1p1f2/Amon/tas/gn/v20200318/tas_Amon_MIROC-ES2L_historical_r6i1p1f2_gn_185001-201412.nc'
ds = xc.open_mfdataset(fn)
tas = ds('tas')
tas.spatial.average()

<xarray.DataArray 'tas' (time: 1980)> Size: 16kB dask.array<truediv, shape=(1980,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates:

  • time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 Attributes: standard_name: air_temperature long_name: Near-Surface Air Temperature comment: near-surface (usually, 2 meter) air temperature units: K original_name: T2 cell_methods: area: time: mean cell_measures: area: areacella history: 2019-12-27T22:22:52Z altered by CMOR: Treated scalar dime...

pochedls avatar Feb 01 '25 18:02 pochedls

Using the spatial functionality, I started an example of how we might approach this goal with a new branch.

import xcdat as xc fn = '/p/css03/esgf_publish/CMIP6/CMIP/MIROC/MIROC-ES2L/historical/r6i1p1f2/Amon/tas/gn/v20200318/tas_Amon_MIROC-ES2L_historical_r6i1p1f2_gn_185001-201412.nc' ds = xc.open_mfdataset(fn) tas = ds('tas') tas.spatial.average()

<xarray.DataArray 'tas' (time: 1980)> Size: 16kB dask.array<truediv, shape=(1980,), dtype=float64, chunksize=(1,), chunktype=numpy.ndarray> Coordinates:

  • time (time) object 16kB 1850-01-16 12:00:00 ... 2014-12-16 12:00:00 Attributes: standard_name: air_temperature long_name: Near-Surface Air Temperature comment: near-surface (usually, 2 meter) air temperature units: K original_name: T2 cell_methods: area: time: mean cell_measures: area: areacella history: 2019-12-27T22:22:52Z altered by CMOR: Treated scalar dime...

This is awesome! I'll take a closer look soon. Happy to see you got a working prototype.

tomvothecoder avatar Feb 03 '25 18:02 tomvothecoder

The Xarray team is testing how index coordinates with non-array dimensions can be propagated (see PR #9671, PR #10116, and PR #10137). This update would allow DataArrays to include bounds. If adopted, we’ll need to revisit how xCDAT works with DataArrays.

Key Challenges:

  1. Accessor Extension:
    Dataset accessors can’t automatically work on DataArrays because they’re registered separately. One solution is to design a shared base class and then register separate accessors for each type. For example:

     import xarray as xr
    
     class BaseAccessor:
         def shared_method(self):
             # Shared implementation for both Dataset and DataArray
             return "shared result"
    
     @xr.register_dataset_accessor("my_accessor")
     class MyDatasetAccessor(BaseAccessor):
         def __init__(self, xarray_obj):
             self._obj = xarray_obj
             # Additional initialization specific to Datasets
    
     @xr.register_dataarray_accessor("my_accessor")
     class MyDataArrayAccessor(BaseAccessor):
         def __init__(self, xarray_obj):
             self._obj = xarray_obj
             # Additional initialization specific to DataArrays
    
  2. API Workflow:
    xCDAT Dataset accessor APIs currently require a data_var string to target a specific variable, and the operation returns a Dataset containing only that variable. We must update this logic to work seamlessly with DataArrays if we want to share the functionalities across accessor classes.

Questions to Consider:

  • For each Dataset accessor (e.g., SpatialAccessor), should there be a corresponding DataArray accessor?
  • How can we share code between these accessors to avoid duplication?
  • What changes are necessary to ensure the accessors work correctly with both Datasets and DataArrays?

tomvothecoder avatar Mar 27 '25 22:03 tomvothecoder

What if we create a function library that can accept either xr.DataArray or xr.Dataset objects:

  • xc.spatial.spatial_average(...)
  • xc.temporal.departures(...)
  • ...

pochedls avatar Mar 27 '25 23:03 pochedls

What if we create a function library that can accept either xr.DataArray or xr.Dataset objects:

* `xc.spatial.spatial_average(...)`

* `xc.temporal.departures(...)`

* ...

This is a possible option. We have some functions that operate like this.

Another approach is to create a DataArray accessor that mimics a Dataset accessor (similar to what you tried before). The steps are:

  1. Create a DataArray accessor equivalent to the Dataset accessor.
  2. Convert the DataArray to a Dataset.
  3. Call the Dataset accessor method.
  4. Convert the result back to a DataArray.

Implementation

import xarray as xr

@xr.register_dataset_accessor('spatial')
class SpatialAccessor:
    def __init__(self, ds):
        self._ds = ds

    def average(self, data_var):
        # Implement averaging for a Dataset
        return self._ds[data_var].mean()

@xr.register_dataarray_accessor('spatial')
class DataArrayAccessor:
    def __init__(self, da):
        self._da = da
		self.data_var = self._da.name
        # Convert the DataArray to a Dataset using its name, or a default if None.
        self._ds = da.to_dataset(name=da.name or "default")

    def average(self):
        # Delegate to the Dataset accessor and extract the DataArray by name.
        return self._ds.spatial.average(self.data_var)[self.data_var]

Usage

import xcdat as xc

ds = xc.open_dataset(...)
tas = ds["tas"]

tas_avg = tas.spatial.average()

This approach lets you use the same accessor methods on both Datasets and DataArrays by converting between them as needed.

Pros:

  • Code reuse: Implements logic once in the Dataset accessor.
  • Consistent API: Allows the same method to be called on both Datasets and DataArrays.
  • Modularity: Keeps conversion separate from core functionality.

Cons:

  • Conversion overhead: Extra steps can impact performance.
  • Name dependency: Relies on the DataArray having a valid name.
  • Handling edge cases: Structural differences between DataArrays and Datasets might lead to ambiguity.
    • When converting a DataArray to a Dataset, you rely on the variable’s name and assume it fits neatly as a single variable within a Dataset structure. However, if the DataArray lacks a clear name or its associated metadata doesn't align well with the Dataset format, the conversion might not preserve all contextual details. This mismatch in structure can lead to ambiguity about which data or metadata is being processed or returned.
    • Probably not an issue because xCDAT APIs tend to operate on a single data variable at a time and returns a Dataset with just that variable

tomvothecoder avatar Mar 28 '25 17:03 tomvothecoder