geopyspark icon indicating copy to clipboard operation
geopyspark copied to clipboard

Use geotrellis-contrib to perform windowed reads on remote rasters

Open echeipesh opened this issue 6 years ago • 2 comments

A typical GPS workflow might start like this:

pats = ["s3://bucket/one.tiff"]
raster_layer = gps.RasterLayer.from_numpy_rdd(gps.LayerType.SPATIAL, gps.rasterio.get(paths))
tiled_layer = raster_layer.tile_to_layout(gps.GlobalLayout(256), target_crs=3857)
tiled_layer.count()

This is great for various reasons but has following cost:

  • Conversion from numpy to GeoTrellis Tiles on read (low cost)
  • Iterate over full dataset to figure out the data resolution and extent
    • Seen as RasterSummary spark job
  • Iterate over full dataset again to read it
  • Spark Shuffle to input tiles to some layout
  • Spark Shuffle to generate BufferedTiles for reproject operation
    • reproject operation is implied by inclusion of target_crs
  • Spark Shuffle to compute result of reproject operation on BufferedTiles

Prototyped here is an alternative workflow that avoids the high cost of spark shuffles on pixel tiles in this process: https://github.com/geotrellis/geotrellis-contrib/blob/demo/wsel/wse/src/main/scala/Main.scala

This can be encapsulated in a GeoPySpark operation that combines the read and tile_to_layout (with optional reprojection) steps to produce a TiledRasterLayer.

Ideally API can be similar to tile_to_layout perhaps:

    def read_to_layout(paths,
                       layout=LocalLayout(),
                       target_crs=None,
                       resample_method=ResampleMethod.NEAREST_NEIGHBOR,
                       partition_strategy=None,
                       reader=ReadMethod.GEOTRELLIS):

There is an outstanding question of what should happen to merge order. I think currently any overlapping raster is going to get merged in arbitrary order as result of tile_to_layout operation and that would be an easy default to start from here.

echeipesh avatar Oct 17 '18 17:10 echeipesh

This gist contains some more mock Python API/logic for the read_to_layout function. One issue that was raised when writing this code was with geotrellis-contrib's inability to create SpaceTime RDDs; as the current GPS API allows users to specify the LayerType when reading in data. @echeipesh do you think it'd be worth refactoring the geotrellis-contrib code so that it can work on a generic K? Or should we just implement separate logic to deal with that case?

With the read_to_layout method, we'll have three different ways of reading in raw geospatial data into an RDD:

Not Tiled:

  • gps.geotiff.get
  • gps.rasterio.get

Tiled:

  • gps.raster_source.read_to_layout

I think it'd make sense then to combine gps.geotiff.get and gps.rasterio.get into one function. By having one function, we'll decrease the surface area of GPS' API and simplify workflows. This could also be called, get and it can reside in gps.raster_source. The user could switch between GT and GDAL reading via the read_method parameter.

The one issue with this from an API perspective is the number of parameter's gps.geotiff.get has. To get around this, we could place them in a namedtuple:


class GeoTiffOptions(namedtuple("GeoTiffOptions", 'chunk_size partition_bytes time_tag time_format delimiter s3_client s3_credentials')):
    __slots__ = []

And then pass in those options as a parameter to get


def get(layer_type,
        paths,
        tile_size=MAX_TILE_SIZE,
        target_crs=None,
        geotrellis_options=GeoTiffOptions(),
        read_method=ReaderMethod.GEOTRELLIS)

jbouffard avatar Oct 24 '18 17:10 jbouffard

@echeipesh do you think it'd be worth refactoring the geotrellis-contrib code so that it can work on a generic K? 

Yes, that's one of the next steps in that project but its probably going to take a little bit to design that API. I don't think lack of that capability should block this feature. For now I think it would be acceptable tor throw NotImplementedError if asked for temporal layer.

I think it'd make sense then to combine gps.geotiff.get and gps.rasterio.get into one function.

Agreed that it makes sense to group this functionality by tiled and not-tiled use case. But grouping rasterio and GeoTrellis read methods could potentially be more confusing. You're going to end up with one hell of a doc-string on the function trying to explain differences in behavior. Also the ways in which either method is going to fail and needs to be configured is going to be significantly different. rasterio is also an optional dependency and at least having it in its own module gives people some way to reason when they've need to have installed it without being surprised by an module not found error. You may have better sense of expectations here but it doesn't feel like it would simplify the the API that much for all the questions it introduces.

Maybe another way is to work towards unifying gps.geotiff.get and read_to_layout? In the end gps.geotiff.get will likely end up being implemented using RasterSources as well and then it would be the difference between asking for tiled and untiled input.

Also another differences is that gps.geotiff.get actually performs the listing where as read_to_layout and rasterio.get accept a list of URIs which must have been discovered beforehand.

echeipesh avatar Oct 25 '18 14:10 echeipesh