oceanspy
oceanspy copied to clipboard
Refactor oceanspy subsample module to use custom xarray indexes
Current Status
An important part of oceanspy's feature space is concerned with providing custom ways to select data from Xarray datasets. These are enumerated on the API docs. The relevant functions are
function | description |
---|---|
cutout(od[, varList, YRange, XRange, ...]) | Cutout the original dataset in space and time preserving the original grid structure. |
mooring_array(od, Ymoor, Xmoor, **kwargs) | Extract a mooring array section following the grid. |
particle_properties(od, times, Ypart, Xpart, ...) | Extract Eulerian properties of particles using nearest-neighbor interpolation. |
survey_stations(od, Ysurv, Xsurv[, delta, ...]) | Extract survey stations. |
Oceanspy needed to implement these functions because xarray's built in indexers (e.g. .sel()
or .interp()
) were not capable of performing the required operations in the case of curvilinear grids.
Ongoing Xarray Refactor
Supported by a CZI grant, the Xarray team has been hard at work on the so-called Flexible Indexes Refactor
Xarray currently keeps track of indexes associated with coordinates by storing them in the form of a pandas.Index in special xarray.IndexVariable objects.
The limitations of this model became clear with the addition of pandas.MultiIndex support in xarray 0.9, where a single index corresponds to multiple xarray variables. MultiIndex support is highly useful, but xarray now has numerous special cases to check for MultiIndex levels.
A cleaner model would be to elevate indexes to an explicit part of xarray’s data model, e.g., as attributes on the Dataset and DataArray classes. Indexes would need to be propagated along with coordinates in xarray operations, but will no longer would need to have a one-to-one correspondance with coordinate variables. Instead, an index should be able to refer to multiple (possibly multidimensional) coordinates that define it. See GH 1603 for full details
Specific tasks:
- Add an indexes attribute to xarray.Dataset and xarray.Dataset, as dictionaries that map from coordinate names to xarray index objects.
- Use the new index interface to write wrappers for pandas.Index, pandas.MultiIndex and scipy.spatial.KDTree.
- Expose the interface externally to allow third-party libraries to implement custom indexing routines, e.g., for geospatial look-ups on the surface of the Earth.
In addition to the new features it directly enables, this clean up will allow xarray to more easily implement some long-awaited features that build upon indexing, such as groupby operations with multiple variables.
Additional information about the refactor can be found at:
- https://github.com/pydata/xarray/pull/5692
- https://github.com/pydata/xarray/projects/1
Once the PR 5692 is merged, this feature should be useable for development purposes.
Proposal: Refactor Oceanspy subsample function to be custom Xarray indexes
The whole point of this refactor ("allow third-party libraries to implement custom indexing routines") is to enable projects like OceanSpy to bring their own concepts of indexing directly to xarray datasets. So I thought I would propose we do exactly that. The steps would look something like this.
- [ ] We experiment with the flexible indexes API and try to learn how it works using simple prototypes
- [ ] We translate the subsample functions one by one to into third-party xarray indexes
- [ ] Consider separating these indexes into a standalone package which provides Xarray entrypoints, such that the indexes can be used independently from oceanspy
- [ ] Refactor oceanspy by deleting code, reducing the future maintenance burden for oceanspy developers 🎉
Pros
- This is a more module design with better separation of concerns between elements
- The indexes can be used more widely on all manner of datasets, not just those loaded by oceanspy
- Potentially less code to maintain in oceanspy itself
Cons
- The work involved in the refactoring
- Probably others I can't think of
Thanks for opening an issue, @rabernat ! Does seem like an exciting proposal and a great way to keep maintaining oceanspy in the long run. Will get to this after ocean sciences. Till then, I'll wait to hear what @malmans2 and @Mikejmnez have to say.
Thanks for the enthusiasm @asiddi24! It's great that you see this as exciting. I see it as more of an unglamorous backend maintenance task that oceanspy users may not even notice...but will ultimately lead to better performance and maintainability.
I think they're all good suggestions! OceanSpy definitely needs maintenance/refactoring. I've another couple of suggestions:
- Make use of xoak to perform nearest neighbor interpolations and extract stations/moorings/floats. However, it might be that
xoak
will get superseded by xarray refactoring. I'm not up to speed with the ongoing xarray refactor, butxoak
has been working great for me so far. - OceanSpy naming convention is currently based on MITgcm conventions. I think using cf_xarray under the hood would make OceanSpy much more robust and easy to use (especially for users that are not familiar with MITgcm).
The xoak creator (@benbovy) is also the one leading the xarray index refactor. So I imagine they will converge in some way. Maybe xoak will just provide the index objects themselves, which xarray can then use? Benoit, we would love to hear about your plans for xoak (and get your general feedback on this issue).
Thanks for pinging me @rabernat.
Maybe xoak will just provide the index objects themselves, which xarray can then use?
Yes that's the plan with Xoak. I think it will still be useful to provide Dataset / DataArray accessors, for example to expose Xarray-compatible low-level API like an .xoak.query()
method to get the indices and distances of/to the nearest neighbors.
While I've not looked much into Oceanspy and this a bit outside of my domain of expertise, the subsample functions seem good uses cases for experimenting with Xarray custom indexes, which at this stage would also be really helpful for the Xarray index refactor itself as I'm sure there's still much room for improvement!
Consider separating these indexes into a standalone package which provides Xarray entrypoints, such that the indexes can be used independently from oceanspy
Make use of xoak to perform nearest neighbor interpolations and extract stations/moorings/floats.
Those are sensible points. I think that in the mid/long term it will be better for the ecosystem if we can avoid a jungle of Xarray indexes with lots of overlapping features.
In https://github.com/pydata/xarray/pull/5692 we require that matching indexes for alignment (merge, etc.) must have exactly the same type, which limits interoperability between indexes but makes the implementation much simpler. We might eventually support some kind of "duck" indexes, but it's a considerably harder problem.