rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

Expose sample on the rioxarray accessor

Open martinfleis opened this issue 1 year ago • 9 comments

I was recently comparing the performance of point sampling from a raster in xvec (https://github.com/xarray-contrib/xvec/issues/81) and learned that when using xarray's sel, the sampling is about 40x slower than if I use rasterio's sample method (see the notebook in the linked issue).

I can possibly use sample from the rasterio object available within the rio._manager as list(dtm_da.rio._manager.acquire().sample(list(zip(x, y)))) but that is using a private API of rioxarray.

Would it be in scope to expose sample directly on the rio accessor? We could then consume it in xvec when dealing with rioxarray-backed DataArrays.

martinfleis avatar Aug 29 '24 08:08 martinfleis

rio._manager is something that is easily lost after performing operations on the DataArray|Dataset and is only available when the DataArray|Dataset is opened with rioxarray.open_rasterio or when using engine="rasterio". Unfortunately, due to these limitations, exposing sample on the rio accessor would likely cause confusion as it would not always work as expected.

I would be interested to know if this method is any faster.

If it is, that may enable us to add rio.sample:

def sample(self, *args, **kwargs):
    try:
       return self._manager.acquire().sample(*args, **kwargs)
    except AttributeError:
        pass
    with MemoryFile() as memfile:
        self._xds.rio.to_raster(memfile.name)
        with memfile.open() as dst:
             return dst.sample(*args, **kwargs)

snowman2 avatar Aug 29 '24 13:08 snowman2

Using the DTM from the notebook above, the trip via MemoryFile takes 36s, about 4x more than using the built-in sel method of xarray, so there's no point in doing that.

It seems that exposing rasterio's sample is not a feasible option. Thanks anyway!

martinfleis avatar Aug 29 '24 14:08 martinfleis

These observations are also useful to be aware of: https://github.com/xarray-contrib/xvec/issues/81#issuecomment-2318036713

Thanks for the proposal!

snowman2 avatar Aug 29 '24 19:08 snowman2

I just looked into this a little bit, and notice that selecting points via xarray first loads the all pixels a buffered envelope around all the points, rather than the rasterio.sample approach of serially loading points as single pixel windowed reads. This is a big contrast if you open a large remote VRT!

import logging
import rasterio
import xarray as xr

logging.basicConfig(level=logging.DEBUG,
                    handlers=[logging.StreamHandler()]
                    )

# Size is 1296001, 417601
url = 'https://opentopography.s3.sdsc.edu/raster/SRTM_GL1_Ellip/SRTM_GL1_Ellip_srtm.vrt'

ENV = rasterio.Env(GDAL_DISABLE_READDIR_ON_OPEN='EMPTY_DIR',
                   CPL_DEBUG=True,
                   )

# Sample a few points spread far apart 
# https://docs.xarray.dev/en/latest/user-guide/interpolation.html#advanced-interpolation
points = [(-122.33, 47.60), (-74.00, 40.71), (-95.37, 29.76)]
xi = xr.DataArray([p[0] for p in points], dims="z")
yi = xr.DataArray([p[1] for p in points], dims="z")

with rasterio.open(url) as src:
    samples_rasterio = list(rasterio.sample.sample_gen(src, points))
# IO window xoff=304668.0 yoff=108864.0 width=1.0 height=1.0
# IO window xoff=381600.0 yoff=69444.0 width=1.0 height=1.0
# IO window xoff=304668.0 yoff=108864.0 width=1.0 height=1.0

# NOTE: xarray tries to load a large buffered window arround all points
# This will likely crash, requiring too much memory
daOT = xr.open_dataarray(url, engine='rasterio').squeeze()
samples_xarray = daOT.sel(x=xi, y=yi, method='nearest').compute()
# IO window xoff=207612.0 yoff=44640.0 width=173989.0 height=64225.0

@benbovy, do you know if there is a way to make xarray only sample neighborhoods around points? Or is this behavior unique to rioxarray? I admit I haven't looked at the code close enough to know where that envelope window is coming from...

scottyhq avatar Jan 06 '25 13:01 scottyhq

I think it would be nice to have a built-in function to make this simple and efficient with rioxarray (or xvec) as it's a common use-case. The workaround I'm currently using is to compute windows ahead of time and use isel_window to extract values:

# Loop over custom windows 
# NOTE: could be various sizes to enable bilinear interpolation etc
# e.g. https://github.com/kylebarron/demquery/blob/5871ad22541c7645b456a8fb998dc2be57b5ef91/demquery/demquery.py#L43
from rasterio.windows import Window
values = []
with rasterio.open(url) as src:
    for point in points:
        row, col = src.index(point[0], point[1])
        window = Window(col, row, 1, 1)
        values.append(daOT.rio.isel_window(window).to_numpy())

Edit: actually, it's more straightforward to just loop & stick with da.sel (single points trigger IO windows with width=1.0 height=1.0):

pts_da = np.zeros_like(xi)
for i,p in enumerate(points):
    pts_da[i] = dtm_da.sel(x=p[0], y=p[1], method="nearest").squeeze().to_numpy()

scottyhq avatar Jan 06 '25 13:01 scottyhq

@benbovy, do you know if there is a way to make xarray only sample neighborhoods around points? Or is this behavior unique to rioxarray?

This seems to be handled by https://github.com/corteva/rioxarray/blob/4745aec6df1dccbece17dd23ee7ae4830bfb4c6a/rioxarray/_io.py#L291

More specifically RasterioArrayWrapper._get_indexer(key) returns a single window regardless of the input key. For vectorized indexing this is likely not optimal in most cases and yields very poor performance in cases like sampling a huge DTM with some arbitrarily located points spread far apart.

Maybe RasterioArrayWrapper could be refactored so it separately handles orthogonal vs. vectorized indexing? (see, e.g., the ZarrArrayWrapper, which is another example of a BackendArray subclass, although I don't how this should be properly implemented in 3rd-party IO Xarray backends in the context of the NamedArray refactor). Maybe vectorized indexing could be implemented there by relying on rasterio's sample? Not sure as I'm not familiar enough with rasterio / rioxarray. Computing a window for each input point could be an alternative although it may not be optimal (or even yield bad performance) in many cases too?

benbovy avatar Jan 06 '25 15:01 benbovy

https://github.com/corteva/rioxarray/blob/4745aec6df1dccbece17dd23ee7ae4830bfb4c6a/rioxarray/_io.py#L457

You'll have to chnage this to IndexingSupport.Vectorized to receive the right inputs to do what you need IIUC. Currently Xarray is rewriting the user query to only use OuterIndexer as requested by this line.

dcherian avatar Jan 06 '25 17:01 dcherian

Is it (already) recommended for 3rd-party backend array classes to migrate to .oindex and .vindex?

benbovy avatar Jan 06 '25 19:01 benbovy

no anderson and my work is pending a review from Stephan :(

dcherian avatar Jan 06 '25 21:01 dcherian