geopyspark
geopyspark copied to clipboard
Use geotrellis-contrib to perform windowed reads on remote rasters
A typical GPS workflow might start like this:
pats = ["s3://bucket/one.tiff"]
raster_layer = gps.RasterLayer.from_numpy_rdd(gps.LayerType.SPATIAL, gps.rasterio.get(paths))
tiled_layer = raster_layer.tile_to_layout(gps.GlobalLayout(256), target_crs=3857)
tiled_layer.count()
This is great for various reasons but has following cost:
- Conversion from
numpy
to GeoTrellis Tiles on read (low cost) - Iterate over full dataset to figure out the data resolution and extent
- Seen as
RasterSummary
spark job
- Seen as
- Iterate over full dataset again to read it
- Spark Shuffle to input tiles to some layout
- Spark Shuffle to generate
BufferedTiles
for reproject operation- reproject operation is implied by inclusion of
target_crs
- reproject operation is implied by inclusion of
- Spark Shuffle to compute result of reproject operation on
BufferedTiles
Prototyped here is an alternative workflow that avoids the high cost of spark shuffles on pixel tiles in this process: https://github.com/geotrellis/geotrellis-contrib/blob/demo/wsel/wse/src/main/scala/Main.scala
This can be encapsulated in a GeoPySpark operation that combines the read
and tile_to_layout
(with optional reprojection) steps to produce a TiledRasterLayer
.
Ideally API can be similar to tile_to_layout
perhaps:
def read_to_layout(paths,
layout=LocalLayout(),
target_crs=None,
resample_method=ResampleMethod.NEAREST_NEIGHBOR,
partition_strategy=None,
reader=ReadMethod.GEOTRELLIS):
There is an outstanding question of what should happen to merge order. I think currently any overlapping raster is going to get merged in arbitrary order as result of tile_to_layout
operation and that would be an easy default to start from here.
This gist contains some more mock Python API/logic for the read_to_layout
function. One issue that was raised when writing this code was with geotrellis-contrib
's inability to create SpaceTime RDDs; as the current GPS API allows users to specify the LayerType
when reading in data. @echeipesh do you think it'd be worth refactoring the geotrellis-contrib
code so that it can work on a generic K
? Or should we just implement separate logic to deal with that case?
With the read_to_layout
method, we'll have three different ways of reading in raw geospatial data into an RDD
:
Not Tiled:
- gps.geotiff.get
- gps.rasterio.get
Tiled:
- gps.raster_source.read_to_layout
I think it'd make sense then to combine gps.geotiff.get
and gps.rasterio.get
into one function. By having one function, we'll decrease the surface area of GPS' API and simplify workflows. This could also be called, get
and it can reside in gps.raster_source
. The user could switch between GT and GDAL reading via the read_method
parameter.
The one issue with this from an API perspective is the number of parameter's gps.geotiff.get
has. To get around this, we could place them in a namedtuple
:
class GeoTiffOptions(namedtuple("GeoTiffOptions", 'chunk_size partition_bytes time_tag time_format delimiter s3_client s3_credentials')):
__slots__ = []
And then pass in those options as a parameter to get
def get(layer_type,
paths,
tile_size=MAX_TILE_SIZE,
target_crs=None,
geotrellis_options=GeoTiffOptions(),
read_method=ReaderMethod.GEOTRELLIS)
@echeipesh do you think it'd be worth refactoring the geotrellis-contrib code so that it can work on a generic K?
Yes, that's one of the next steps in that project but its probably going to take a little bit to design that API. I don't think lack of that capability should block this feature. For now I think it would be acceptable tor throw NotImplementedError
if asked for temporal layer.
I think it'd make sense then to combine gps.geotiff.get and gps.rasterio.get into one function.
Agreed that it makes sense to group this functionality by tiled and not-tiled use case. But grouping rasterio and GeoTrellis read methods could potentially be more confusing. You're going to end up with one hell of a doc-string on the function trying to explain differences in behavior. Also the ways in which either method is going to fail and needs to be configured is going to be significantly different. rasterio
is also an optional dependency and at least having it in its own module gives people some way to reason when they've need to have installed it without being surprised by an module not found error. You may have better sense of expectations here but it doesn't feel like it would simplify the the API that much for all the questions it introduces.
Maybe another way is to work towards unifying gps.geotiff.get
and read_to_layout
? In the end gps.geotiff.get
will likely end up being implemented using RasterSources
as well and then it would be the difference between asking for tiled and untiled input.
Also another differences is that gps.geotiff.get
actually performs the listing where as read_to_layout
and rasterio.get
accept a list of URIs which must have been discovered beforehand.