rioxarray
rioxarray copied to clipboard
Using Dask in reprojection
I believe rioxarray loads the whole data array before reprojecting. Reprojecting a data array chunk by chunk would allow to handle much bigger data. It would also remain in the Dask lazy execution model.
It would definitely be a nice feature to have. One gotcha to watch out for is that reprojection considers points outside of the dask chunk due to resampling.
This is something we are constantly working on in the pyresample project (https://github.com/pytroll/pyresample/). We have a lot of dask compatible solutions but no explicitly defined interfaces. We're working towards a "Resampler" class that would make it easier to control dask-based resampling algorithms. Note that we aren't calling in to the GDAL resampling algorithms.
It should be noted that resampling is not the easiest operation to port to dask friendliness. One thing is as @snowman2 mentioned, you often have to deal with an overlap of source data chunks. Dask has some functions that can help with that like map_overlap (https://docs.dask.org/en/latest/array-overlap.html). The other issue is just the nature of resampling and trying to do it per chunk. You typically have to build a custom dask graph if you want to do anything efficient. For example, I have an Elliptical Weighted Averaging (EWA) algorithm in pyresample that I just recently rewrote this way. You can see the progress here: https://github.com/pytroll/pyresample/pull/284. The hard part is efficiently saying "this input chunk should map to this output chunk". This type of check typically means you have to have M * N tasks where M is number of input chunks and N is number of output chunks. I've had to do some ugly things in that PR to try to do things "intelligently".
CC the other pyresample core devs @pnuu and @mraspaud
Thanks @djhoese, I'll checkout pyresample.
Oh I should have mentioned that the interfaces that aren't fully defined in pyresample are being used in Satpy. That might be a better starting point if you're curious. Of course, we're always working towards better rioxarray, rasterio, and plain old xarray compatibility.
I wonder if WarpedVRT
could be useful here? I bet there is a way to have each chunk lazily loaded in from disk and reprojected using the data on disk at the same time using dask in parallel. This would remove any/all buffer problems as it would have access to all of data on disk.
- https://rasterio.readthedocs.io/en/latest/topics/virtual-warping.html
- https://rasterio.readthedocs.io/en/latest/topics/concurrency.html
Very interesting. I'm not sure whether virtual warping is doing something extremely magical or just doing the warping when you ask for it. Your concurrency link has been brought up before in a dask issue and I had some concerns that I discussed here: https://github.com/dask/dask/issues/3255#issuecomment-626220009
I'll quote it here:
I'm very confused by that rasterio example. I've never used a ThreadPoolExecutor, but it doesn't seem like any of the multi-threaded/parallel code in that example has to do with rasterio. The only thing spanning thread boundaries are numpy arrays that are read from a geotiff in the main thread and written to a geotiff in the main thread. The only thing that looks possibly multi-threaded for rasterio is the data_gen generator used in:
for window, result in zip( windows, executor.map(compute, data_gen) ):
But that would mean that .map is passing generator context to each thread which seems dangerous rather than iterating over the generator and giving the resulting object (a numpy array) to the compute function in a separate thread.
just doing the warping when you ask for it
It should just do it when you ask for it - lazy loading as I understand it.
The only thing spanning thread boundaries are numpy arrays that are read from a geotiff in the main thread
This is exactly the bit I am plugging for - read in and warp the data into a dask array in parallel.
https://rasterio.readthedocs.io/en/latest/api/rasterio.vrt.html
Abstracts the details of raster warping and allows access to data that is reprojected when read.
I read through the code, and it seems like it is already possible to do -
You can pass in rasterio.WarpedVRT
: https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray-open-rasterio
I haven't tested loading it in parallel with dask personally. Would be interesting to test sometime.
Hi,
I worked on Pyresample gradient search resampler (originally implemented by @mraspaud) in https://github.com/pytroll/pyresample/pull/282 a while back. The resampler now maps the source chunks to those, and only those, target chunks they have any overlap with. All the chunks that do not overlap are discarded to reduce the amount of computations. As there are often multiple input chunks contributing to a output chunk, these two versions are stacked before the full output array is formed.
The source -> target chunk mapping needs some non-dask computations to check for the overlap, but these are cheap operations for cases where the source data has a proper projection. The actual resampling (implemented in Cython) is delayed and the delayed results are concatenated (and padded) to form the resulting 2/3D array.
There are certainly some optimizations I missed, but I'm pretty happy with the results so far :-)
Looks like the ODC group has a nice solution for dask reprojection: https://github.com/corteva/rioxarray/issues/130#issuecomment-645766627
Nice! You think we can steal that code? :smile:
You think we can steal that code?
It would be a lot of code to take over. There are some bits from the datacube-core
to pull in as well as from the odc-tools
. If we did that, we would need to add the odc-tools
license into the repo and make sure to credit them for the code taken.
My preference would be for them to put it into a separate package that we could import in and use. Might be a worthwhile discussion to have.
Opened an issue: https://github.com/opendatacube/odc-tools/issues/58 :crossed_fingers:
+1 on integrating dask/lazy arrays into the reprojection method. Suggested above was using Warped_VR, which I have used on some projects. Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2...
I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects.
Suggested above was using Warped_VR, which I have used on some projects ...
That is pretty neat. Thanks for sharing :+1:. How much did it improve performance?
Side note: I am curious if the chunks
argument to open_rasterio
would work in your example or not?
I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects.
It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio.
Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2...
Is this something you would be interested in adding to the examples section in the documentation of rioxarray
?
Happy to share @snowman2, and yes, I would be happy to add the example to the docs.
Also, I just saw this pr in xarray that allows xarray to read the warped_vrt object directly, so I have removed some unnecessary lines of code in the gist I posted above.
That is pretty neat. Thanks for sharing 👍. How much did it improve performance?
This is a pretty small file (26 MB), so I think the dask/scheduling overhead => than the parallel speedup in io.
Side note: I am curious if the
chunks
argument toopen_rasterio
would work in your example or not?
Yes, you can use the chunks
argument in open_rasterio
as well.
It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio.
Thanks for linking to the rasterio discussion on multi-dim rasters.
The thing I have not been able to figure out is how to build a warped_vrt from a xarray object. I think it could be useful to represent xarray objects as a vrt and rather than point to a on disk files, point to numpy or dask array. However, I am not sure that is possible with the vrt format....
Happy to share @snowman2, and yes, I would be happy to add the example to the docs.
That would be great :+1:
The thing I have not been able to figure out is how to build a warped_vrt from a xarray object.
Interesting. Not sure if it makes sense to do so as the VRT is file based IIRC, but I haven't given it much thought. Feel free to open a new issue to look into and discuss this further.
Looks like the ODC group has a nice solution for dask reprojection: #130 (comment)
I'm working on something based on this and with generous advice from Kirill Kouzoubov (though I'm sure I've made errors of my own, and I haven't set up the error handling very well, yet). I added multiband chunking and support for applying separate kwargs to different band chunks.
https://github.com/lex-c/gdal_dask_reproject
It took a few hours with the rio commandline to warp a 40GB vrt dataset to a set bounds and 400GB upscale - be interesting to see how the dask/rioxarray variety does - match chunk size to the rasterio memory use 'defaults'?
Related:
- https://github.com/opendatacube/odc-geo/issues/26
- https://github.com/pytroll/pyresample/pull/341
Related:
Yes, looking forward to trying the odc-geo version out sometime.
Any progress on this? I am looking to follow the large reproject suggestions, but with a vrt we build e.g.
env = rasterio.Env(
GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
CPL_VSIL_CURL_USE_HEAD=False,
CPL_VSIL_CURL_ALLOWED_EXTENSIONS="VRT",
)
with env:
with rasterio.open('gs://some/path.vrt') as src:
with WarpedVRT(src, crs="EPSG:3857") as vrt:
darray = rioxarray.open_rasterio(
vrt, chunks={"x": 512, "y": 512}
)
but I get RasterioIOError: '/vsigs/some/path.vrt' does not exist in the file system, and is not recognized as a supported dataset name.
I am getting a sense that there is something more to it than this...
To report back here (should have done it earlier sorry):
We have implemented a resample_blocks
function in pyresample:
https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.resampler.resample_blocks
This allows to do various operations (eg resampling) by blocks when the source and target arrays are in different CRSs. Resample blocks ensures in principle that the input data to the custom func
is covering the chunk at hand in the target domain, but not much more. So there could be some overlap, but not much.
Just wanted to let people know that odc-geo>=0.4.0
now supports Dask inputs/outputs. It uses GDAL
via rasterio
to perform actual pixel work, pyproj
and shapely
are used for determining minimal dependency graph from source to output chunks. Xarray inputs with arbitrary number of dimensions are supported as long as Y,X
axis are next to each other and in that order.
Thanks vey much Kirill.
How big a thing have you tested? :)