xoak icon indicating copy to clipboard operation
xoak copied to clipboard

Changes to optionally return distances with .sel

Open kthyng opened this issue 2 years ago • 8 comments

Changes include:

  • accessor optionally takes in a "distances_name" str with which to name the new variable in results to store calculated distances from query()
  • query() now returns distances
  • distances are returned (always?) in radians from the trees, but are converted to kilometers in query.
  • If a DataArray was input to .sel with "distances_name" input, a Dataset will be returned to contain the new variable
  • added example to intro notebook
  • added a test to test_accessor
  • also fixed an import in conftest

Notes:

  • I did not test this with the dask approach

kthyng avatar Jan 19 '23 18:01 kthyng

Sorry for all the additions. I have continued to use the code in the project I'm working on and have realized more about the existing code and better approaches as I go.

kthyng avatar Jan 19 '23 21:01 kthyng

I forgot to say this was meant to address https://github.com/xarray-contrib/xoak/issues/26.

kthyng avatar Jan 20 '23 15:01 kthyng

@benbovy or @willirath are either of you available to take a look at this? I'm hoping to be able to use these changes in a version update of xoak with a package of mine that uses xoak. Thanks for your time.

kthyng avatar Jan 23 '23 15:01 kthyng

@benbovy I'm completely open to the API and detailed choices in the code if you want to change anything. I just need access to the distances! I am aware of the new indices in xarray but haven't seen them in action to see how they work or how to use them yet.

kthyng avatar Jan 23 '23 16:01 kthyng

I am aware of the new indices in xarray but haven't seen them in action to see how they work or how to use them yet.

Yes this is pretty new and still largely undocumented. Xoak is a good place to try this feature, but we need a smooth transition so I don't expect it to be straightforward.

I'm completely open to the API and detailed choices in the code if you want to change anything.

Sorry I could have provided more details in my previous comment. I was thinking about this alternative:

  • Keep Dataset.xoak.sel(indexers=None, **indexers_kwargs) unchanged so that it will be easier to later switch to using Dataset.sel((indexers=None, **indexers_kwargs) with a custom Xarray index implemented in xoak. We might want eventually provide custom options to Dataset.sel but it is still not clear how would look like the API (https://github.com/pydata/xarray/issues/7099).

  • Add a Dataset.query(indexers=None, **indexers_kwargs) that returns both distances and indices as Xarray DataArray and Dataset objects (#32).

This would look like this (reusing the examples in the documentation):

>>> ds_mesh
# <xarray.Dataset>
# Dimensions:  (x: 100, y: 100)
# Coordinates:
#     lat      (x, y) float64 ...
#     lon      (x, y) float64 ...
# Dimensions without coordinates: x, y
# Data variables:
#     field    (x, y) float64 ...

>>> ds_trajectory
# <xarray.Dataset>
# Dimensions:    (trajectory: 30)
# Dimensions without coordinates: trajectory
# Data variables:
#     latitude   (trajectory) ...
#     longitude  (trajectory) ...

>>> ds_mesh.xoak.set_index(['lat', 'lon'], 'sklearn_geo_balltree')

>>> distances, indices = ds_mesh.xoak.query(
...    lat=ds_trajectory.latitude,
...    lon=ds_trajectory.longitude,
... )

>>> distances
# <xarray.DataArray (trajectory: 30)>
# ...
# Dimensions without coordinates: trajectory

>>> indices
# <xarray.Dataset>
# Dimensions:  (trajectory: 30)
# Dimensions without coordinates: trajectory
# Data variables:
#     x        (trajectory) int64 ...
#     y        (trajectory) int64 ...

Having a query method that returns a (distances, indices) tuple is consistent with scipy.spatial.KDTree.query(), sklearn.neighbors.KDTree.query(), sklearn.neighbors.BallTree.query(), etc.

The distances and indices Xarray objects returned here are a bit confusing as they don't have the same type and structure, but it allows doing:

>>> results = ds_mesh.isel(indices).assign(distances=distances)
>>> results
# <xarray.Dataset>
# Dimensions:  (trajectory: 30)
# Coordinates:
#     lat        (trajectory) float64 ...
#     lon        (trajectory) float64 ...
# Dimensions without coordinates: trajectory
# Data variables:
#     field      (trajectory) float64 ...
#     distances  (trajectory) float64 ...

Now, this is clearly more verbose than what is proposed in this PR:

>>> ds_mesh.xoak.sel(
...     lat=ds_trajectory.latitude,
...     lon=ds_trajectory.longitude,
...     distances_name="distances"
... )

Also, because Dataset.xoak.query() doesn't return a single Xarray object it wouldn't be easy to chain it with other Dataset / DataArray methods (it is possible using pipe -> merge but this is less straightforward than just using Dataset.xoak.sel()). So maybe we could support returning the distances via both .xoak.query() and xoak.sel() after all? What do you think @kthyng @willirath?

benbovy avatar Jan 24 '23 09:01 benbovy

So maybe we could support returning the distances via both .xoak.query() and xoak.sel() after all?

If we're going to support both, I think we can either merge this PR and then refactor #32 or the other way around.

benbovy avatar Jan 24 '23 09:01 benbovy

@benbovy I like having the indices and distances explicitly available with

distances, indices = ds_mesh.xoak.query( ...

as you suggest. One thing I didn't like about my change is that when a DataArray is input but distances are returned, a Dataset is returned to contain both variables. It's been messing up my code here and there.

Maybe it would be better to keep it simple with .sel (no changes) and if you want to have the distances and indices explicitly, you have to do the two step process you laid out of

distances, indices = ds_mesh.xoak.query(
...    lat=ds_trajectory.latitude,
...    lon=ds_trajectory.longitude,
... )
results = ds_mesh.isel(indices).assign(distances=distances)

But I do think it should be up to you @benbovy and @willirath so let me know what you prefer and I can try to implement, or you can go ahead with it, either way.

kthyng avatar Jan 24 '23 16:01 kthyng

I think leaving .xoak.sel() unchanged and as close to the standard xarray .sel() API as possible is what we should do. The proposal to have .query() returning, both, distances and indices sounds like the best way to implement this.

willirath avatar Jan 24 '23 16:01 willirath