pyresample
pyresample copied to clipboard
Status of dask support in pyresample?
The latest pyresample docs state:
Interfaces to XArray objects (including dask array support) are provided in separate Resampler class interfaces and are in active development.
However, there is no further mention of xarray / dask support in the rest of the docs. There seem to be a few dask issues (e.g. #148), but I could not ascertain the status of xarray / dask support based on browsing the docs and the repo.
Could you clarify where things stand? My use case is that I would like to use pyresample lazily on dask arrays, where the data is chunked contiguously in space but has many samples in time. Here's what I have tried, representing each timestep as a different channel:
import numpy as np
import dask.array as da
from pyresample import image, geometry
area_def = geometry.AreaDefinition('areaD', 'Europe (3km, HRV, VTC)', 'areaD',
{'a': '6378144.0', 'b': '6356759.0',
'lat_0': '50.00', 'lat_ts': '50.00',
'lon_0': '8.00', 'proj': 'stere'},
800, 800,
[-1370912.72, -909968.64,
1029087.28, 1490031.36])
msg_area = geometry.AreaDefinition('msg_full', 'Full globe MSG image 0 degrees',
'msg_full',
{'a': '6378169.0', 'b': '6356584.0',
'h': '35785831.0', 'lon_0': '0',
'proj': 'geos'},
3712, 3712,
[-5568742.4, -5568742.4,
5568742.4, 5568742.4])
# Here I have 10 "timesteps" (pyresample calls these "channels")
# The channels are the final axis (axis=2) of the array
# works if I use numpy
# data = np.random.rand(3712, 3712, 10)
data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
msg_con_nn = image.ImageContainerNearest(data, msg_area, radius_of_influence=50000)
msg_con_nn.resample(area_def)
I would expect this to lazily return a dask array with the same chunk structure as the input array, with resampling performed on demand as the chunks are loaded. Instead I hit an error when creating the ImageContainerNearest
object:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-9-d3a77de18e5c> in <module>
1 #data = np.random.rand(3712, 3712, 10)
2 data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
----> 3 msg_con_nn = image.ImageContainerNearest(data, msg_area, radius_of_influence=50000)
/srv/conda/envs/notebook/lib/python3.7/site-packages/pyresample/image.py in __init__(self, image_data, geo_def, radius_of_influence, epsilon, fill_value, reduce_data, nprocs, segments)
255 super(ImageContainerNearest, self).__init__(image_data, geo_def,
256 fill_value=fill_value,
--> 257 nprocs=nprocs)
258 self.radius_of_influence = radius_of_influence
259 self.epsilon = epsilon
/srv/conda/envs/notebook/lib/python3.7/site-packages/pyresample/image.py in __init__(self, image_data, geo_def, fill_value, nprocs)
59 geo_def = geo_def.freeze()
60 if not isinstance(image_data, (np.ndarray, np.ma.core.MaskedArray)):
---> 61 raise TypeError('image_data must be either an ndarray'
62 ' or a masked array')
63 elif ((image_data.ndim > geo_def.ndim + 1) or
TypeError: image_data must be either an ndarray or a masked array
Sadly, the old interfaces do not support dask. We've been playing around with support for Xarray DataArray objects backed by dask when using Satpy and have had a lot of success, but not everything we want lives in pyresample at the moment. That said I think you could do what you want if you are willing to wrap things in a DataArray first and are willing to deal with some less than perfect interfaces.
import numpy as np
import dask.array as da
import xarray as xr
from pyresample import geometry
from pyresample.kd_tree import XArrayResamplerNN
area_def = geometry.AreaDefinition('areaD', 'Europe (3km, HRV, VTC)', 'areaD',
{'a': '6378144.0', 'b': '6356759.0',
'lat_0': '50.00', 'lat_ts': '50.00',
'lon_0': '8.00', 'proj': 'stere'},
800, 800,
[-1370912.72, -909968.64,
1029087.28, 1490031.36])
msg_area = geometry.AreaDefinition('msg_full', 'Full globe MSG image 0 degrees',
'msg_full',
{'a': '6378169.0', 'b': '6356584.0',
'h': '35785831.0', 'lon_0': '0',
'proj': 'geos'},
3712, 3712,
[-5568742.4, -5568742.4,
5568742.4, 5568742.4])
# Here I have 10 "timesteps" (pyresample calls these "channels")
# The channels are the final axis (axis=2) of the array
# works if I use numpy
# data = np.random.rand(3712, 3712, 10)
data = da.random.random((3712, 3712, 10), chunks=(3712, 3712, 1))
data = xr.DataArray(data, dims=('y', 'x', 'time'))
resampler = XArrayResamplerNN(msg_area, area_def, 50000)
resampler.get_neighbour_info()
result = resampler.get_sample_from_neighbour_info(data)
Note that nearest neighbor in pyresample uses the KDTree from pykdtree
which is faster than scipy and uses OpenMP, but is not dask or multi-process friendly otherwise. This means that we have to create a single KDTree from the lon/lats of the source area definition (3712, 3712). We then query it in parallel. Also, since it is using OpenMP you may want to limit the number of OpenMP threads being used export OMP_NUM_THREADS=1
(or 2). See https://satpy.readthedocs.io/en/latest/faq.html for other tips that we normally recommend when using dask and resampling.
Edit: The point of the last paragraph was to say: I've never tested these dask interfaces on a cluster or multiprocess scheduler. Threaded scheduler will probably perform best.
And...because I have you here, keep an eye out for https://github.com/geoxarray/geoxarray which should simplify some of this in the future (taking a Dataset from anywhere and remap it with pyresample).
@rabernat I don't know if you question is still relevant, but I just want to name the recent work on the resample_blocks
function: https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.resampler.resample_blocks
It aims at being a counterpart to dask's map_blocks
for transforming arrays between different projections.
@djhoese @mraspaud are there any examples of using the resample_blocks
function with Xarray objects? I'm interested in trying it out as an alternative to using map_blocks
with rasterio's reprojection (xref https://github.com/carbonplan/ndpyramid/issues/10).
@maxrjones I'm just coming back from paternity leave and also don't have much experience with customizing calls to resample_blocks
, but if I remember correctly @mraspaud designed it only for dask array inputs and not DataArray/Dataset inputs. @mraspaud will have to provide any more details if he has them.
resample_blocks
was written with dask arrays in mind, but it should be quite straightforward to add a small wrapper to support xarray also. The docstring shows how it was thought to be used, but feels free to ask more specific questions if you need help.
Sounds good, thanks!