xarray-schema
xarray-schema copied to clipboard
Specifying a schema in terms of a Protocol
Hello! Thanks for making xarray-schema
!
It would be great to be able to write a xarray schema in terms of a typing.Protocol
. This would enable the schema to be used for both runtime and static validations. Let me describe my motivation here (it might already be obvious..)
One challenge with designing a code base that passes around xarray arrays & datasets, which satisfy particular schemas, is: documenting which flavors of datasets are accepted by a given function. Furthermore, for complicated schemas in particular, it is particularly useful for static tools (type-checkers and other IDE tools) to be able to tell a user what attributes do and do not exist for that xarray object.
I have leveraged protocols to tackle these issues. Consider the following protocol that describes a dataset with the coordinates time
and feature_component
and variables features
and temperatures
from typing import Protocol
class DataSetA(Protocol):
@property
def time(self) -> xr.DataArray:
"""
Coordinate, shape-(N,), dtype-int
"""
...
@property
def feature_component(self) -> xr.DataArray:
"""
Coordinate, shape-(D,), dtype-int
The index for each component of a feature vector.
"""
...
@property
def features(self) -> xr.DataArray:
"""
Data-Variable, shape-(N, D), dtype-float
The D-dimensional vector for each feature.
Coordinates:
* time [N]
* descriptor_component [D]
"""
...
@property
def temperatures(self) -> xr.DataArray:
"""
Data-Variable, shape-(N,), dtype-float
Temperature measurements.
shape-(N,) | dtype-float
Coordinates:
* feature_id [N]
"""
...
With this, I can write functions like:
def process_dataset(data: DataSetA):
...
Not only does this annotation succinctly document to users what flavor of dataset is expected by process_data
, static tooling can now auto-complete / statically check the usages of data
according to this protocol within the function. This is really nice to have.
It would be great to be able to write DataSetA
so that it serves as a schema as well. In this way, DataSetA
serves as
- Documentation for users
- A type that can be understood by static analysis tooling
- A schema for runtime validation.
Obviously, this would involve substantially more sophisticated return types for the coordinates and data variables, beyond xr.DataArray
. Shape and dtype info would need to be specified as well. Perhaps particular forms of Annotated[xr.DataArray, ...]
would suffice.
Finally, I have to flag a substantial shortcoming of DataSetA
: it doesn't "look" like a proper xarray.Dataset
to static analysis tools. E.g. .loc
, .sel
don't exist. So really, there needs to be proper protocols that describe xarray.DataArray
and xarray.Dataset
, which can be subclassed by the likes of DataSetA
to remedy this. It isn't clear to me that xarray
itself would ship such protocols, or if xarray-schema
would do so.
Thanks for reading this post. I'll be interested to hear your thoughts on this!
I decided to open an issue on xarray
to propose that they implement protocols for Dataset
and DataArray
.
https://github.com/pydata/xarray/issues/6462
Sorry @rsokl for missing your post for so long. I think this is an interesting idea and one worth exploring. @andersy005 has also thought of something similar in the context of pydantic.