pangeo-example-notebooks
pangeo-example-notebooks copied to clipboard
Improving Performance with Cloud-Optimized Geotiffs (COGs) - xarray,rasterio,dask
As we mentioned in our landsat8 demo blog post (https://medium.com/pangeo/cloud-native-geoprocessing-of-earth-observation-satellite-data-with-pangeo-997692d91ca2), there is still much room for improvement.
Here is a nice benchmarking analysis of reading cloud-optimized-geotiffs (COGs) on AWS: https://github.com/opendatacube/benchmark-rio-s3/blob/master/report.md#rasterio-configuration
And discussion of the report here: http://osgeo-org.1560.x6.nabble.com/Re-Fwd-Cloud-optimized-GeoTIFF-configuration-specifics-SEC-UNOFFICIAL-tt5367948.html
It would be great to do similar benchmarking with our example, and see if there are simple ways to improve how COGs are read with the combination of xarray, dask, and rasterio.
Pinging some notebook authors on this one, @mrocklin, @jhamman, @rsignell-usgs, @darothen !
Thanks for posting this @scottyhq . That was a very interesting read. I'm now curious about how rasterio/GDAL handles reading tiles on remote stores like S3. From the report it sounds like they issue a separate synchronous request for each tile. That would indeed be unfortunate, assuming that our tiles are somewhat small.
Doing some informal exploration ourselves would probably be a good use of time. I'm also curious what other performance-oriented COG groups do in practice. Assuming that things are as bad as they say, some possibilities:
- See if we can improve GDAL/rasterio to at least read longer ranges when we're asking for several contiguous tiles at once (my hope is that they already do this)
- If not, ask people to store data with larger tiles
- Write our own lighter-weight COG reader that perhaps does some optimizaions as above, and perhaps operates asynchronously (though this may quickly become a rabbit hole?)
It's worth noting that currently XArray uses a lock around rasterio calls, so this is possibly more extreme on our end.
Write our own lighter-weight COG reader
Not to beat a drum unnecessarily, but this sounds a lot like something that could live in an Intake driver.
Thanks for the input @mrocklin and @martindurant
Personally, I would not go the route of a new reader because GDAL and rasterio put significant effort into this. Once we know the best settings for a given rasterio/gdal version it seems that Intake would definitely simplify the process of loading data in the most efficient manner!
New configuration options for Cloud storage are becoming available with each GDAL release (many post version 2.3): https://trac.osgeo.org/gdal/wiki/ConfigOptions#GDAL_HTTP_MULTIRANGE https://www.gdal.org/gdal_virtual_file_systems.html#gdal_virtual_file_systems_vsis3
There are some issues in rasterio worth linking to here as well: https://github.com/mapbox/rasterio/issues/1513 https://github.com/mapbox/rasterio/issues/1507 https://github.com/mapbox/rasterio/issues/1362
@mrocklin can you elaborated on your last comment please?
It's worth noting that currently XArray uses a lock around rasterio calls, so this is possibly more extreme on our end.
Perhaps, but I think that you wouldn't want all of the complexity of such a thing within intake itself. I think that this is lower level. It could be a plugin or whatever, sure, but that's a conversation that would happen much much later if folks choose to take that route.
On Mon, Oct 22, 2018 at 10:40 AM Martin Durant [email protected] wrote:
Write our own lighter-weight COG reader
Not to beat a drum unnecessarily, but this sounds a lot like something that could live in an Intake driver.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pangeo-data/pangeo-example-notebooks/issues/21#issuecomment-431855085, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszOWeAb6dF7Dx-GU5MAFYT_vXb6Cxks5undjGgaJpZM4Xxf7_ .
https://github.com/pydata/xarray/blob/671d936166414edc368cca8e33475369e2bb4d24/xarray/backends/rasterio_.py#L311-L314
I suspect that folks found that the rasterio library was not threadsafe, and so they protect calling multiple rasterio functions at the same time. This would affect multi-threaded dask workers. If one worker was pulling data using rasterio then any other worker asked to pull data at the same time would wait until the first one finished.
I saw this issue linked from rasterio repo, I thought I'll chime in if you don't mind. I'm an author of that benchmarking report linked by @scottyhq.
-
GDAL doesn't read one tile a time, adjacent tiles (in byte space) are read together in one http fetch. Cloud Optimized GeoTiffs have tiles stored in row-major order. So reads that span multiple tiles horizontally will be faster than reads that span multiple tiles vertically. GDAL's vsicurl io driver will not attempt to read bytes you haven't requested, even if it would be faster to do so in S3 case, but it does coalesce adjacent requests into one.
-
About the lock: rasterio default open behaviour is to re-use already opened file handles (this happens inside GDAL library). And GDAL concurrency model assumes one file handle per thread, so file handle re-use breaks concurrency when two threads read from the same resource even if each one opened "their own version". However more recent versions of rasterio have
sharing=False
parameter, using that, GeoTIFF access should be thread-safe. But if you are accessing netcdfs/hdf5 files,sharing=False
won't help since hdf5 library is not thread-safe at all, it assume one single thread accesses the whole library at a time (it's really bad like that). So if you want to support any file type that gdal supports in a thread-safe manner, the safest thing to do is to grab a global process lock ( which is what xarray devs are doing). I guess one could detect file type and have a whitelist of "safe io dirvers" that can be read concurrently.
Thanks for joining this conversation @Kirill888 . And thank you for the excellent benchmark.
Short term it sounds like there are two easy wins that we might want to do:
- Change XArray's rasterio-lock to be per file rather than for the full library, or (perhaps easier) use the sharing=False parameter and see the performance effect of that.
- Change our default chunking to favor row-major order
@Kirill888 regarding the point about layout. Lets say I have a 1000x1000 array tiled into 100x100 chunks and I ask for a 500x500 piece in the upper left.
x[:500, :500]
My understanding of what you say above is that this issues five separate web requests. Is that correct?
@mrocklin your understanding is correct, assuming typical tiff layout and empty caches this read will generate 5 requests, one for each x[:100, :500]
shaped slices. If however you were to read x[:500,:]
this would be done in one request, and will probably be faster when reading from S3 bucket in the same region.
You can see it for yourself by injecting this in to rasterio.Env
:
-
CPL_CURL_VERBOSE=True
This will print detailed info for everything libcurl
is doing on behalf of GDAL to stderr
. To just extract byte ranges read I use this snippet:
#!/bin/sh
exec awk -F'[=-]' '/^Range:/{print $2 "\t" $3}' $@
OK, so we should prefer auto-chunking that assumes row-major order among tiles. We might even consider warning is the user asks for something else. Those are pretty easy to achieve.
Thank you @Kirill888 for chiming in. We knew that our access times were much worse than they should have been, this gives us a couple good things to take on to make good progress here.
Thanks so much for helping with this @mrocklin and @Kirill888.
I put together an example notebook to help guide a solution using the master branch of xarray, since I thought the recent refactor might affect things (https://github.com/pydata/xarray/pull/2261):
https://github.com/scottyhq/xarray/blob/rasterio-env/examples/cog-test-rasterioEnv.ipynb
The notebook captures the GDAL HTTP requests happening in the background and shows pretty clearly that xr.open_rasterio() is issuing more GET requests than required...
we should prefer auto-chunking that assumes row-major order among tiles
Is this possible currently? Is there a work around, or does xr.open_rasterio
need modifying?
In putting together the notebook I noticed a few other things. Not all of the options from rasterio.open
are available from xr.open_rasterio
. Relevant to this discussion in particular is sharing=False
. The machinery of xr.open_rasterio is a little over my head, so I'm hoping someone else can help out here. I don't think it's as simple as allowing for **kwargs
...
https://github.com/pydata/xarray/blob/671d936166414edc368cca8e33475369e2bb4d24/xarray/backends/rasterio_.py
we should prefer auto-chunking that assumes row-major order among tiles
Is this possible currently? Is there a work around, or does xr.open_rasterio need modifying?
In this line:
with xr.open_rasterio(httpURL, chunks={'band': 1, 'x': xchunk, 'y': ychunk}) as da:
You should experiment with the aspect ratio of chunk sizes. I suspect that you'll want chunks that are not square, but rather chunks that are as long as possible in one direction.
@scottyhq - thanks for putting this together. TLDR; xarray probably will need some modifications to make some of what you want happen.
I'm not particularly surprised xarray's rasterio backend isn't well optimized for this use case. I suspect you are actually the most qualified to take a few swings at the modifying the rasterio backend in xarray. I would be happy to get you spun up on this.
Another thing to think about is @mrocklin's PR: https://github.com/pydata/xarray/pull/2255. I'm not sure exactly what stopped progress on this but you may want to take a look there too.
Our landsat example isn't working with the latest versions of xarray (0.11.0), rasterio (1.0.10) and dask - 0.20.2, distributed 1.24, dask-kubernetes 0.6). this seems to be an issue w/ dask distributed having access to global environment variables, but I'm not exactly sure...
Note things work fine w/o the kubernetes cluster, but the same command on the cluster results in a CURL error. To me, it seems workers aren't seeing the os.environ configuration or don't have access to the certificate path? I'm wondering if the recent xarray refactor might be affecting things (https://github.com/pydata/xarray/pull/2261)?
Notebook here with full error traceback: https://github.com/scottyhq/cog-benchmarking/blob/master/notebooks/landsat8-cog-ndvi-mod.ipynb
For what it's worth, the error I was encountering:
/srv/conda/lib/python3.6/site-packages/rasterio/__init__.py in open()
215 # None.
216 if mode == 'r':
--> 217 s = DatasetReader(path, driver=driver, **kwargs)
218 elif mode == 'r+':
219 s = get_writer_for_path(path)(path, mode, driver=driver, **kwargs)
rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()
RasterioIOError: CURL error: error setting certificate verify locations: CAfile: /etc/pki/tls/certs/ca-bundle.crt CApath: none
Is due to rasterio 1.0.10 installed via pip. Installing rasterio 1.0.10 via conda-forge resolves the problem.
This is a known issue (https://github.com/mapbox/rasterio/commit/b621d92c51f7c2021f89cd4487cecdd7c201f320), but for some reason, export CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
didn't resolve the error when running on a dask distributed cluster.
nice @scottyhq! Anything left to do here?
Let's keep this open since improvements are showing up regularly:
Planning to do some additional benchmarking with xarray/dask/rasterio. In the meantime, here are some recent relevant benchmarks for GDAL, which is ultimately doing all the IO:
optimizing GDAL configuration: https://lists.osgeo.org/pipermail/gdal-dev/2018-December/049508.html
COG creation and benchmarking for a few datasets on AWS: https://github.com/vincentsarago/awspds-benchmark https://medium.com/@VincentS/do-you-really-want-people-using-your-data-ec94cd94dc3f
On various compression settings for COGs: https://kokoalberti.com/articles/geotiff-compression-optimization-guide/