rioxarray icon indicating copy to clipboard operation
rioxarray copied to clipboard

Using Dask in reprojection

Open davidbrochart opened this issue 4 years ago • 25 comments

I believe rioxarray loads the whole data array before reprojecting. Reprojecting a data array chunk by chunk would allow to handle much bigger data. It would also remain in the Dask lazy execution model.

davidbrochart avatar Jun 11 '20 10:06 davidbrochart

It would definitely be a nice feature to have. One gotcha to watch out for is that reprojection considers points outside of the dask chunk due to resampling.

snowman2 avatar Jun 11 '20 13:06 snowman2

This is something we are constantly working on in the pyresample project (https://github.com/pytroll/pyresample/). We have a lot of dask compatible solutions but no explicitly defined interfaces. We're working towards a "Resampler" class that would make it easier to control dask-based resampling algorithms. Note that we aren't calling in to the GDAL resampling algorithms.

It should be noted that resampling is not the easiest operation to port to dask friendliness. One thing is as @snowman2 mentioned, you often have to deal with an overlap of source data chunks. Dask has some functions that can help with that like map_overlap (https://docs.dask.org/en/latest/array-overlap.html). The other issue is just the nature of resampling and trying to do it per chunk. You typically have to build a custom dask graph if you want to do anything efficient. For example, I have an Elliptical Weighted Averaging (EWA) algorithm in pyresample that I just recently rewrote this way. You can see the progress here: https://github.com/pytroll/pyresample/pull/284. The hard part is efficiently saying "this input chunk should map to this output chunk". This type of check typically means you have to have M * N tasks where M is number of input chunks and N is number of output chunks. I've had to do some ugly things in that PR to try to do things "intelligently".

CC the other pyresample core devs @pnuu and @mraspaud

djhoese avatar Jun 11 '20 14:06 djhoese

Thanks @djhoese, I'll checkout pyresample.

davidbrochart avatar Jun 11 '20 14:06 davidbrochart

Oh I should have mentioned that the interfaces that aren't fully defined in pyresample are being used in Satpy. That might be a better starting point if you're curious. Of course, we're always working towards better rioxarray, rasterio, and plain old xarray compatibility.

djhoese avatar Jun 11 '20 14:06 djhoese

I wonder if WarpedVRT could be useful here? I bet there is a way to have each chunk lazily loaded in from disk and reprojected using the data on disk at the same time using dask in parallel. This would remove any/all buffer problems as it would have access to all of data on disk.

snowman2 avatar Jun 12 '20 13:06 snowman2

  • https://rasterio.readthedocs.io/en/latest/topics/virtual-warping.html
  • https://rasterio.readthedocs.io/en/latest/topics/concurrency.html

snowman2 avatar Jun 12 '20 13:06 snowman2

Very interesting. I'm not sure whether virtual warping is doing something extremely magical or just doing the warping when you ask for it. Your concurrency link has been brought up before in a dask issue and I had some concerns that I discussed here: https://github.com/dask/dask/issues/3255#issuecomment-626220009

I'll quote it here:

I'm very confused by that rasterio example. I've never used a ThreadPoolExecutor, but it doesn't seem like any of the multi-threaded/parallel code in that example has to do with rasterio. The only thing spanning thread boundaries are numpy arrays that are read from a geotiff in the main thread and written to a geotiff in the main thread. The only thing that looks possibly multi-threaded for rasterio is the data_gen generator used in:

                for window, result in zip(
                    windows, executor.map(compute, data_gen)
                ):

But that would mean that .map is passing generator context to each thread which seems dangerous rather than iterating over the generator and giving the resulting object (a numpy array) to the compute function in a separate thread.

djhoese avatar Jun 12 '20 13:06 djhoese

just doing the warping when you ask for it

It should just do it when you ask for it - lazy loading as I understand it.

snowman2 avatar Jun 12 '20 13:06 snowman2

The only thing spanning thread boundaries are numpy arrays that are read from a geotiff in the main thread

This is exactly the bit I am plugging for - read in and warp the data into a dask array in parallel.

snowman2 avatar Jun 12 '20 13:06 snowman2

https://rasterio.readthedocs.io/en/latest/api/rasterio.vrt.html

Abstracts the details of raster warping and allows access to data that is reprojected when read.

snowman2 avatar Jun 12 '20 13:06 snowman2

I read through the code, and it seems like it is already possible to do - You can pass in rasterio.WarpedVRT: https://corteva.github.io/rioxarray/stable/rioxarray.html#rioxarray-open-rasterio

I haven't tested loading it in parallel with dask personally. Would be interesting to test sometime.

snowman2 avatar Jun 12 '20 14:06 snowman2

Hi,

I worked on Pyresample gradient search resampler (originally implemented by @mraspaud) in https://github.com/pytroll/pyresample/pull/282 a while back. The resampler now maps the source chunks to those, and only those, target chunks they have any overlap with. All the chunks that do not overlap are discarded to reduce the amount of computations. As there are often multiple input chunks contributing to a output chunk, these two versions are stacked before the full output array is formed.

The source -> target chunk mapping needs some non-dask computations to check for the overlap, but these are cheap operations for cases where the source data has a proper projection. The actual resampling (implemented in Cython) is delayed and the delayed results are concatenated (and padded) to form the resulting 2/3D array.

There are certainly some optimizations I missed, but I'm pretty happy with the results so far :-)

pnuu avatar Jun 12 '20 19:06 pnuu

Looks like the ODC group has a nice solution for dask reprojection: https://github.com/corteva/rioxarray/issues/130#issuecomment-645766627

snowman2 avatar Jun 18 '20 13:06 snowman2

Nice! You think we can steal that code? :smile:

davidbrochart avatar Jun 18 '20 13:06 davidbrochart

You think we can steal that code?

It would be a lot of code to take over. There are some bits from the datacube-core to pull in as well as from the odc-tools. If we did that, we would need to add the odc-tools license into the repo and make sure to credit them for the code taken.

My preference would be for them to put it into a separate package that we could import in and use. Might be a worthwhile discussion to have.

snowman2 avatar Jun 18 '20 13:06 snowman2

Opened an issue: https://github.com/opendatacube/odc-tools/issues/58 :crossed_fingers:

snowman2 avatar Jun 18 '20 14:06 snowman2

+1 on integrating dask/lazy arrays into the reprojection method. Suggested above was using Warped_VR, which I have used on some projects. Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2...

I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects.

rmg55 avatar Aug 15 '20 21:08 rmg55

Suggested above was using Warped_VR, which I have used on some projects ...

That is pretty neat. Thanks for sharing :+1:. How much did it improve performance?

Side note: I am curious if the chunks argument to open_rasterio would work in your example or not?

I also noticed that gdal 3.1 has support for multidimensional rasters (including mem and vrt). Not sure exactly how that might be integrated into rasterio, but I wonder if that could provide some additional gdal functionality into xarray (rioxarray) objects.

It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio.

snowman2 avatar Aug 16 '20 00:08 snowman2

Here is a gist showing the workflow: https://gist.github.com/rmg55/875a2b79ee695007a78ae615f1c916b2...

Is this something you would be interested in adding to the examples section in the documentation of rioxarray?

snowman2 avatar Aug 16 '20 00:08 snowman2

Happy to share @snowman2, and yes, I would be happy to add the example to the docs.

Also, I just saw this pr in xarray that allows xarray to read the warped_vrt object directly, so I have removed some unnecessary lines of code in the gist I posted above.

That is pretty neat. Thanks for sharing 👍. How much did it improve performance?

This is a pretty small file (26 MB), so I think the dask/scheduling overhead => than the parallel speedup in io.

Side note: I am curious if the chunks argument to open_rasterio would work in your example or not?

Yes, you can use the chunks argument in open_rasterio as well.

It has been brought up in the past: https://rasterio.groups.io/g/dev/topic/33040759. If there are enough interested parties willing to develop/maintain that feature, there is a slim possibility of it being added to rasterio.

Thanks for linking to the rasterio discussion on multi-dim rasters.

The thing I have not been able to figure out is how to build a warped_vrt from a xarray object. I think it could be useful to represent xarray objects as a vrt and rather than point to a on disk files, point to numpy or dask array. However, I am not sure that is possible with the vrt format....

rmg55 avatar Aug 16 '20 04:08 rmg55

Happy to share @snowman2, and yes, I would be happy to add the example to the docs.

That would be great :+1:

The thing I have not been able to figure out is how to build a warped_vrt from a xarray object.

Interesting. Not sure if it makes sense to do so as the VRT is file based IIRC, but I haven't given it much thought. Feel free to open a new issue to look into and discuss this further.

snowman2 avatar Aug 17 '20 13:08 snowman2

Looks like the ODC group has a nice solution for dask reprojection: #130 (comment)

I'm working on something based on this and with generous advice from Kirill Kouzoubov (though I'm sure I've made errors of my own, and I haven't set up the error handling very well, yet). I added multiband chunking and support for applying separate kwargs to different band chunks.

https://github.com/lex-c/gdal_dask_reproject

calexat-123 avatar Jun 21 '21 14:06 calexat-123

It took a few hours with the rio commandline to warp a 40GB vrt dataset to a set bounds and 400GB upscale - be interesting to see how the dask/rioxarray variety does - match chunk size to the rasterio memory use 'defaults'?

RichardScottOZ avatar Sep 01 '21 01:09 RichardScottOZ

Related:

  • https://github.com/opendatacube/odc-geo/issues/26
  • https://github.com/pytroll/pyresample/pull/341

snowman2 avatar Apr 04 '22 19:04 snowman2

Related:

Yes, looking forward to trying the odc-geo version out sometime.

RichardScottOZ avatar May 30 '22 07:05 RichardScottOZ

Any progress on this? I am looking to follow the large reproject suggestions, but with a vrt we build e.g.

env = rasterio.Env(
    GDAL_DISABLE_READDIR_ON_OPEN="EMPTY_DIR",
    CPL_VSIL_CURL_USE_HEAD=False,
    CPL_VSIL_CURL_ALLOWED_EXTENSIONS="VRT",
)
with env:
    with rasterio.open('gs://some/path.vrt') as src:
        with WarpedVRT(src, crs="EPSG:3857") as vrt:
            darray = rioxarray.open_rasterio(
                vrt, chunks={"x": 512, "y": 512}
            )

but I get RasterioIOError: '/vsigs/some/path.vrt' does not exist in the file system, and is not recognized as a supported dataset name.

I am getting a sense that there is something more to it than this...

ljstrnadiii avatar Feb 01 '23 02:02 ljstrnadiii

To report back here (should have done it earlier sorry): We have implemented a resample_blocks function in pyresample: https://pyresample.readthedocs.io/en/latest/api/pyresample.html#pyresample.resampler.resample_blocks

This allows to do various operations (eg resampling) by blocks when the source and target arrays are in different CRSs. Resample blocks ensures in principle that the input data to the custom func is covering the chunk at hand in the target domain, but not much more. So there could be some overlap, but not much.

mraspaud avatar Feb 03 '23 14:02 mraspaud

Just wanted to let people know that odc-geo>=0.4.0 now supports Dask inputs/outputs. It uses GDAL via rasterio to perform actual pixel work, pyproj and shapely are used for determining minimal dependency graph from source to output chunks. Xarray inputs with arbitrary number of dimensions are supported as long as Y,X axis are next to each other and in that order.

Kirill888 avatar May 22 '23 01:05 Kirill888

Thanks vey much Kirill.

RichardScottOZ avatar May 22 '23 01:05 RichardScottOZ

How big a thing have you tested? :)

RichardScottOZ avatar May 22 '23 01:05 RichardScottOZ