pyresample icon indicating copy to clipboard operation
pyresample copied to clipboard

Status of dask support in pyresample?

Open rabernat opened this issue 5 years ago • 9 comments

The latest pyresample docs state:

Interfaces to XArray objects (including dask array support) are provided in separate Resampler class interfaces and are in active development.

However, there is no further mention of xarray / dask support in the rest of the docs. There seem to be a few dask issues (e.g. #148), but I could not ascertain the status of xarray / dask support based on browsing the docs and the repo.

Could you clarify where things stand? My use case is that I would like to use pyresample lazily on dask arrays, where the data is chunked contiguously in space but has many samples in time. Here's what I have tried, representing each timestep as a different channel:

import numpy as np
import dask.array as da
from pyresample import image, geometry

area_def = geometry.AreaDefinition('areaD', 'Europe (3km, HRV, VTC)', 'areaD',
                               {'a': '6378144.0', 'b': '6356759.0',
                                'lat_0': '50.00', 'lat_ts': '50.00',
                                'lon_0': '8.00', 'proj': 'stere'},
                               800, 800,
                               [-1370912.72, -909968.64,
                                1029087.28, 1490031.36])
msg_area = geometry.AreaDefinition('msg_full', 'Full globe MSG image 0 degrees',
                               'msg_full',
                               {'a': '6378169.0', 'b': '6356584.0',
                                'h': '35785831.0', 'lon_0': '0',
                                'proj': 'geos'},
                               3712, 3712,
                               [-5568742.4, -5568742.4,
                                5568742.4, 5568742.4])

# Here I have 10 "timesteps" (pyresample calls these "channels")
# The channels are the final axis (axis=2) of the array
# works if I use numpy
# data = np.random.rand(3712, 3712, 10)
data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))

msg_con_nn = image.ImageContainerNearest(data, msg_area, radius_of_influence=50000)
msg_con_nn.resample(area_def)

I would expect this to lazily return a dask array with the same chunk structure as the input array, with resampling performed on demand as the chunks are loaded. Instead I hit an error when creating the ImageContainerNearest object:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-9-d3a77de18e5c> in <module>
      1 #data = np.random.rand(3712, 3712, 10)
      2 data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
----> 3 msg_con_nn = image.ImageContainerNearest(data, msg_area, radius_of_influence=50000)

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyresample/image.py in __init__(self, image_data, geo_def, radius_of_influence, epsilon, fill_value, reduce_data, nprocs, segments)
    255         super(ImageContainerNearest, self).__init__(image_data, geo_def,
    256                                                     fill_value=fill_value,
--> 257                                                     nprocs=nprocs)
    258         self.radius_of_influence = radius_of_influence
    259         self.epsilon = epsilon

/srv/conda/envs/notebook/lib/python3.7/site-packages/pyresample/image.py in __init__(self, image_data, geo_def, fill_value, nprocs)
     59             geo_def = geo_def.freeze()
     60         if not isinstance(image_data, (np.ndarray, np.ma.core.MaskedArray)):
---> 61             raise TypeError('image_data must be either an ndarray'
     62                             ' or a masked array')
     63         elif ((image_data.ndim > geo_def.ndim + 1) or

TypeError: image_data must be either an ndarray or a masked array

rabernat avatar Aug 13 '19 17:08 rabernat

Sadly, the old interfaces do not support dask. We've been playing around with support for Xarray DataArray objects backed by dask when using Satpy and have had a lot of success, but not everything we want lives in pyresample at the moment. That said I think you could do what you want if you are willing to wrap things in a DataArray first and are willing to deal with some less than perfect interfaces.

import numpy as np
import dask.array as da
import xarray as xr
from pyresample import geometry
from pyresample.kd_tree import XArrayResamplerNN

area_def = geometry.AreaDefinition('areaD', 'Europe (3km, HRV, VTC)', 'areaD',
                               {'a': '6378144.0', 'b': '6356759.0',
                                'lat_0': '50.00', 'lat_ts': '50.00',
                                'lon_0': '8.00', 'proj': 'stere'},
                               800, 800,
                               [-1370912.72, -909968.64,
                                1029087.28, 1490031.36])
msg_area = geometry.AreaDefinition('msg_full', 'Full globe MSG image 0 degrees',
                               'msg_full',
                               {'a': '6378169.0', 'b': '6356584.0',
                                'h': '35785831.0', 'lon_0': '0',
                                'proj': 'geos'},
                               3712, 3712,
                               [-5568742.4, -5568742.4,
                                5568742.4, 5568742.4])

# Here I have 10 "timesteps" (pyresample calls these "channels")
# The channels are the final axis (axis=2) of the array
# works if I use numpy
# data = np.random.rand(3712, 3712, 10)
data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
data = xr.DataArray(data, dims=('y', 'x', 'time'))

resampler = XArrayResamplerNN(msg_area, area_def, 50000)
resampler.get_neighbour_info()
result = resampler.get_sample_from_neighbour_info(data)

Note that nearest neighbor in pyresample uses the KDTree from pykdtree which is faster than scipy and uses OpenMP, but is not dask or multi-process friendly otherwise. This means that we have to create a single KDTree from the lon/lats of the source area definition (3712, 3712). We then query it in parallel. Also, since it is using OpenMP you may want to limit the number of OpenMP threads being used export OMP_NUM_THREADS=1 (or 2). See https://satpy.readthedocs.io/en/latest/faq.html for other tips that we normally recommend when using dask and resampling.

Edit: The point of the last paragraph was to say: I've never tested these dask interfaces on a cluster or multiprocess scheduler. Threaded scheduler will probably perform best.

djhoese avatar Aug 13 '19 19:08 djhoese

And...because I have you here, keep an eye out for https://github.com/geoxarray/geoxarray which should simplify some of this in the future (taking a Dataset from anywhere and remap it with pyresample).

djhoese avatar Aug 13 '19 19:08 djhoese

@rabernat I don't know if you question is still relevant, but I just want to name the recent work on the resample_blocks function: https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.resampler.resample_blocks It aims at being a counterpart to dask's map_blocks for transforming arrays between different projections.

mraspaud avatar Oct 03 '22 13:10 mraspaud

@djhoese @mraspaud are there any examples of using the resample_blocks function with Xarray objects? I'm interested in trying it out as an alternative to using map_blocks with rasterio's reprojection (xref https://github.com/carbonplan/ndpyramid/issues/10).

maxrjones avatar Mar 19 '24 20:03 maxrjones

@maxrjones I'm just coming back from paternity leave and also don't have much experience with customizing calls to resample_blocks, but if I remember correctly @mraspaud designed it only for dask array inputs and not DataArray/Dataset inputs. @mraspaud will have to provide any more details if he has them.

djhoese avatar Mar 20 '24 13:03 djhoese

resample_blocks was written with dask arrays in mind, but it should be quite straightforward to add a small wrapper to support xarray also. The docstring shows how it was thought to be used, but feels free to ask more specific questions if you need help.

mraspaud avatar Mar 20 '24 14:03 mraspaud

Sounds good, thanks!

maxrjones avatar Mar 20 '24 14:03 maxrjones