rasterframes icon indicating copy to clipboard operation
rasterframes copied to clipboard

Issue reading geotrellis (2.3.3) catalog with rasterframes 0.8.4 with s3 backend when using a custom S3 Producer

Open jdenisgiguere opened this issue 6 years ago • 8 comments

Current situation

I have a geotrellis catalog using the S3 backend. Catalog and data are stored on a minio server. I'm using Geotrellis v2.3.3

When I try to access the catalog with rasterframes v0.8.4, I get the following error messages:

scala> catalogUri
res14: java.net.URI = s3a://geoimagery/geotrellis_geoimagery/

scala> spark.read.geotrellisCatalog(catalogUri)
scala.MatchError: List(metadata__geoimagery_2002__0.json) (of class scala.collection.immutable.$colon$colon)
  at geotrellis.spark.io.hadoop.HadoopAttributeStore$$anonfun$layerIds$1.apply(HadoopAttributeStore.scala:148)
  at geotrellis.spark.io.hadoop.HadoopAttributeStore$$anonfun$layerIds$1.apply(HadoopAttributeStore.scala:147)
  at scala.collection.immutable.List.map(List.scala:284)
  at geotrellis.spark.io.hadoop.HadoopAttributeStore.layerIds(HadoopAttributeStore.scala:147)
  at org.locationtech.rasterframes.datasource.geotrellis.GeoTrellisCatalog$GeoTrellisCatalogRelation.layers$lzycompute(GeoTrellisCatalog.scala:76)
  at org.locationtech.rasterframes.datasource.geotrellis.GeoTrellisCatalog$GeoTrellisCatalogRelation.layers(GeoTrellisCatalog.scala:64)
  at org.locationtech.rasterframes.datasource.geotrellis.GeoTrellisCatalog$GeoTrellisCatalogRelation.schema(GeoTrellisCatalog.scala:103)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:403)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.locationtech.rasterframes.datasource.geotrellis.package$DataFrameReaderHasGeotrellisFormat.geotrellisCatalog(package.scala:52)
  ... 63 elided

Expected situation

I would expect to be able to read the catalog with this configuration.

Detailled environnement

  • rasterframes build from tag 0.8.4-b using the circleci image
  • spark 2.4.4 without hadoop binary
  • hadoop 2.8.5 (for minio support)

jdenisgiguere avatar Jan 06 '20 20:01 jdenisgiguere

My only guess here is what version of GeoTrellis the catalog was created with?

Since the error is thrown in the geotrellis.spark.io.hadoop package, that's where I would go looking for changes. It looks like the packages have been reorganized in GT 3.x series but I'm not familiar with back compatibility situation for catalogs and layers.

vpipkt avatar Jan 07 '20 14:01 vpipkt

Thank you @vpipkt for your quick answer.

We use Geotrellis 2.3.3 which is the version required for rasterframes 0.8.4 according to project/RFDependenciesPlugin.scala.

I would expect to see S3AttributeStore instead of HadoopAttributeStore for a URI with the prefix s3a://.

jdenisgiguere avatar Jan 07 '20 15:01 jdenisgiguere

Just a hunch here that maybe the geotrellis.spark.io.s3.S3LayerProvider is not on the classpath? Or perhaps the META-INF/services/geotrellis.spark.io.AttbitueSotreProvider is not listing geotrellis.spark.io.s3.S3LayerProvider ?

vpipkt avatar Jan 07 '20 21:01 vpipkt

@jdenisgiguere do you happen to have a public version of s3a://geoimagery/geotrellis_geoimagery/ we could use to replicate the issue?

metasim avatar Jan 09 '20 14:01 metasim

I create a git repo with data to reproduce this issue: https://github.com/jdenisgiguere/rasterframes-minio-ZazJXB4U

The repo also contains code to read the Geotrellis Layer with Geotrellis v2.3.3 and a non-working attempt to read the same data with rasterframes 0.8.5. ~~I have an issue with the management of Hadoop versions in the latter.~~

Thanks in advance for your help.

jdenisgiguere avatar Jan 30 '20 13:01 jdenisgiguere

I push a new commit in the proof of concept with rasterframes 0.8.5. This is my last stack trace. https://gist.github.com/jdenisgiguere/fe3d274d1baf2ba2730c920ff8abd128 .

jdenisgiguere avatar Jan 30 '20 14:01 jdenisgiguere

@vpipkt , you gave me a precious hint 3 weeks ago, but I did not have enough background to understand it well. So, using the protocol s3a:://, it is expected that the data is from a Hadoop Data Store. s3:// will use plain AWS Java SDK.
Geotrellis documentation provided explanation on how to configure a S3Provider to use minio, but I don't know how to this with rasterframes.

I could also modify my backend to save data in Geotrellis with a HadoopLayerWrite. Since we cannot use Minio as s3a storage source with the default hadoop version bundled with spark 2.4.4 (Hadoop v2.7), there are more to learn to be able to use pyrasterframes this way.

jdenisgiguere avatar Jan 31 '20 20:01 jdenisgiguere

To use Geotrellis S3 backend with Minio, you cannot provide only the Layer URI. You also need to provide the s3Client. https://github.com/locationtech/geotrellis/blob/master/s3/src/main/scala/geotrellis/store/s3/S3AttributeStore.scala#L43

If I understand well, we cannot currently provide this parameter when we want to read a geotrellis layer or a geotrellis catalog with rasterframes. https://github.com/locationtech/rasterframes/blob/develop/datasource/src/main/scala/org/locationtech/rasterframes/datasource/geotrellis/GeoTrellisRelation.scala#L62-L68

@vpipkt, if you think this is appropriate, this could be tagged as enhancement or close it since it is working as expected.

jdenisgiguere avatar Feb 05 '20 12:02 jdenisgiguere