xarray-sql icon indicating copy to clipboard operation
xarray-sql copied to clipboard

Support SQL-Style Joins between Xarray datasets and Dask/Pandas dataframes

Open alxmrs opened this issue 1 year ago • 6 comments

Here's an example workflow that I'd like to support once this feature exists. This is from Jake Wall of the Mara Elephant Project. Here, he would make use of raster and table data from Earth Engine.

Yeah, so one example, is to extract a NDVI value from an IC for every GPS point recorded by an elephant. We have millions of points that get translated into features. Then a reduce operation is run on the point to get the closest n values in time to when the GPS point occurred. We then spit this back out as an array and join it with the original geopandas dataframe.

I'm imagining this would look like a left join from a Dask Dataframe that had the elephant coordinates to an EE ImageCollection that was opened with Xee via Qarray. Some details are fuzzy, like how we'd interject a NN lookup (maybe, this could be done via a SQL aggregation?).

In general, I think there is broad demand for being able to join raster and tabular data with each other. Later in the line, I bet we could implement geo-aware joins that would make use of geometry.

alxmrs avatar Jan 31 '24 09:01 alxmrs

This should be possible to demo once #8 is complete. If we figure this out, we should document it in the README.

alxmrs avatar Feb 17 '24 05:02 alxmrs

I’ve been reading more into how this is done in the status quo. The best example I can find for joining rasters and point data (and vectors) comes from using a hierarchical spatial index like h3 or s2.

https://github.com/uber/h3-py-notebooks/blob/master/notebooks/unified_data_layers.ipynb

I wonder if this is the technique that underpins Fused.io.

alxmrs avatar Aug 27 '24 01:08 alxmrs

For non-geospatial data, could we use a kdtree to create a hierarchical index? 🤔

alxmrs avatar Aug 27 '24 01:08 alxmrs

This podcast episode is incredibly validating of the use case that this library (and issue) solves.

https://overcast.fm/+AAU1XJb7r0Y/6:55

alxmrs avatar Sep 04 '24 09:09 alxmrs

https://github.com/DahnJ/H3-Pandas

This gives me more confidence that an index system (geospatial via s2 and h3, or pre-computed via kdtrees) is a good integration. To me, this is proof of demand for such features.

alxmrs avatar Oct 11 '24 03:10 alxmrs

To close this, I'd like to see an explicit example (see #73).

alxmrs avatar Sep 27 '25 19:09 alxmrs