sedona icon indicating copy to clipboard operation
sedona copied to clipboard

RS_FromGeoTiff error when reading GeoTiff file

Open fmendezlopez opened this issue 11 months ago • 7 comments

Hello,

I am having an error when reading a GeoTiff file and invoking "RS_FromGeoTiff" function. The code:

` val sedona = SedonaContext.create(datioSparkSession.getSparkSession)

SedonaVizRegistrator.registerAll(sedona)

val filePath = DatioFileSystem.get().qualify("/in/staging/kris/custom/Aqueduct_FL100_2030_RCP45.tif").string()

sedona.read
  .format("binaryFile")
  .load(filePath)
  .selectExpr("RS_FromGeoTiff(content) as raster", "path")
  .selectExpr("raster", "RS_Metadata(raster) as metadata")
  .show(false)`

The error thrown: 2025-01-29T09:44:40,061 [task-result-getter-1/134] [WARN] org.apache.spark.scheduler.TaskSetManager - Lost task 0.1 in stage 0.0 (TID 1) (ip-10-60-253-200.eu-south-2.compute.internal executor 13): org.apache.spark.sql.sedona_sql.expressions.InferredExpressionException: Exception occurred while evaluating expression RS_FromGeoTiff - inputs: [[B@44d7c680], cause: null at org.apache.spark.sql.sedona_sql.expressions.InferredExpression$.throwExpressionInferenceException(InferredExpression.scala:149) at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.eval(InferredExpression.scala:113) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:408) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.lang.IllegalArgumentException at sun.misc.Unsafe.copyMemory(Native Method) at com.esotericsoftware.kryo.io.UnsafeOutput.writeBytes(UnsafeOutput.java:378) at com.esotericsoftware.kryo.io.UnsafeOutput.writeFloats(UnsafeOutput.java:348) at org.apache.sedona.common.raster.serde.KryoUtil.writeFloatArrays(KryoUtil.java:234) at org.apache.sedona.common.raster.serde.DataBufferSerializer.write(DataBufferSerializer.java:58) at org.apache.sedona.common.raster.serde.AWTRasterSerializer.write(AWTRasterSerializer.java:48) at org.apache.sedona.common.raster.DeepCopiedRenderedImage.write(DeepCopiedRenderedImage.java:453) at org.apache.sedona.common.raster.serde.Serde$SerializableState.write(Serde.java:125) at org.apache.sedona.common.raster.serde.Serde.serialize(Serde.java:173) at org.apache.spark.sql.sedona_sql.expressions.raster.implicits$RasterEnhancer.serialize(implicits.scala:46) at org.apache.spark.sql.sedona_sql.expressions.InferrableRasterTypes$.rasterSerializer(InferrableRasterTypes.scala:47) at org.apache.spark.sql.sedona_sql.expressions.InferredRasterExpression$.$anonfun$rasterSerializer$1(InferredRasterExpression.scala:54) at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.eval(InferredExpression.scala:107) ... 19 more

I have tried the following:

  • Same code with other file --> no error thrown
  • Opening the file with QGIS --> loads the layer successfully
  • Executing in a cluster environment, with more memory -> same error
  • Same code in Python --> another error thrown:

`2025-01-29T11:28:27,041 [Thread-42/107] [DEBUG] com.amazonaws.emr.recordserver.connector.spark.sql.SparkPlanValidator - plan is Project [metadata#31, raster#27, point#32, org.apache.spark.sql.sedona_sql.expressions.raster.RS_Contains AS rs_contains(raster, point)#36]+- Project [raster#27, rs_metadata(raster#27) AS metadata#31, org.apache.spark.sql.sedona_sql.expressions.ST_Point AS point#32] +- Project [ org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromGeoTiff AS raster#27, path#19] +- Relation [path#19,modificationTime#20,length#21L,content#22] binaryFile

2025-01-29T11:28:27,051 [Thread-11/37] [ERROR] dataproc.Main - Exception: [NOT_INT] Argument n should be an int, got bool. ` Please, could you help me addressing this issue? Thank you in advance.

fmendezlopez avatar Jan 29 '25 11:01 fmendezlopez

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

github-actions[bot] avatar Jan 29 '25 11:01 github-actions[bot]

I saw java.lang.IllegalArgumentException being thrown by sun.misc.Unsafe.copyMemory(Native Method). The only reason I can think of is the uncompressed pixel data of the raster is larger than 4GB. Sedona cannot serialize and transfer such big rasters. You can try using RS_TileExplode to subdivide the raster into smaller tiles and perform tile-wise operations. This may help get rid of this error, but it will still be quite memory and time consuming.

Kontinuation avatar Feb 04 '25 09:02 Kontinuation

Hello,

We have tried the following code:

df_floods_tile = sedona.sql(f"SELECT RS_TileExplode(content, 2, 2) FROM floods_tif") df_floods_tile = sedona.sql(f"SELECT RS_TileExplode(content, 100, 100) FROM floods_tif") df_floods_tile = sedona.sql(f"SELECT RS_TileExplode(content, 10, 10) FROM floods_tif")

and now the error thrown is this:

`An error was encountered: An error occurred while calling o232.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 10) (ip-10-60-253-102.eu-south-2.compute.internal executor 2): java.lang.IllegalArgumentException: Unsupported raster type: 73 at org.apache.sedona.common.raster.serde.Serde.deserialize(Serde.java:184) at org.apache.spark.sql.sedona_sql.expressions.raster.implicits$RasterInputExpressionEnhancer.toRaster(implicits.scala:38) at org.apache.spark.sql.sedona_sql.expressions.raster.RS_TileExplode.eval(RasterConstructors.scala:107) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:224) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:959) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:407) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2974) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2910) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2909) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.fore`

Is there any way to achieve the correct read of the file by tuning the argumnents passed to RS_TileExplode? If not, is there a way we can do something with Sedona in general?

Thank you.

fmendezlopez avatar Feb 10 '25 11:02 fmendezlopez

The binary content needs to be loaded by RS_FromGeoTiff before being processed by RS_TileExplode. The intermediate raster object loaded by RS_FromGeoTiff will be tiled directly without being serialized/deserialized:

df_floods_tile = sedona.sql(f"SELECT RS_TileExplode(RS_FromGeoTiff(content), 100, 100) FROM floods_tif")

If this still does not work, you have to consider subdividing the GeoTIFF file using gdal_retile before loading it in Sedona.

Kontinuation avatar Feb 11 '25 02:02 Kontinuation

How am I supposed to call RS_FromGeoTiff first if that call throws the error I reported at the beginning?

fmendezlopez avatar Feb 11 '25 08:02 fmendezlopez

Actually nesting RS_FromGeoTiff inside another RS function call changes its behavior. Passing raster objects in between sedona function calls does not require serializing the entire raster value. It is handled by SerdeAware.evalWithoutSerialization.

For instance, the following code could produce a DataFrame of small tiles from a large raster that cannot be serialized as a whole:

(sedona.read
  .format("binaryFile")
  .load(filePath)
  .selectExpr("RS_TileExplode(RS_FromGeoTiff(content), 100, 100) as (x, y, tile)", "path")
  .selectExpr("tile", "RS_Metadata(tile) as metadata")
  .show(10))

Kontinuation avatar Feb 15 '25 08:02 Kontinuation

With exactly that code I am still getting the following error (which is slightly different from the other):

org.apache.spark.sql.sedona_sql.expressions.InferredExpressionException: Exception occurred while evaluating expression RS_FromGeoTiff - inputs: [[B@1d45d5d0], cause: I/O error reading image metadata! at org.apache.spark.sql.sedona_sql.expressions.InferredExpression$.throwExpressionInferenceException(InferredExpression.scala:149) at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.evalWithoutSerialization(InferredExpression.scala:127) at org.apache.spark.sql.sedona_sql.expressions.raster.implicits$RasterInputExpressionEnhancer.toRaster(implicits.scala:34) at org.apache.spark.sql.sedona_sql.expressions.raster.RS_TileExplode.eval(RasterConstructors.scala:107) at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:407) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:888) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:888) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364) at org.apache.spark.rdd.RDD.iterator(RDD.scala:328) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:141) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: org.geotools.data.DataSourceException: I/O error reading image metadata! at org.geotools.gce.geotiff.GeoTiffReader.<init>(GeoTiffReader.java:288) at org.apache.sedona.common.raster.RasterConstructors.fromGeoTiff(RasterConstructors.java:74) at org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromGeoTiff$$anonfun$$lessinit$greater$6.apply(RasterConstructors.scala:55) at org.apache.spark.sql.sedona_sql.expressions.raster.RS_FromGeoTiff$$anonfun$$lessinit$greater$6.apply(RasterConstructors.scala:55) at org.apache.spark.sql.sedona_sql.expressions.InferrableFunctionConverter$.$anonfun$inferrableFunction1$2(InferrableFunctionConverter.scala:39) at org.apache.spark.sql.sedona_sql.expressions.InferredExpression.evalWithoutSerialization(InferredExpression.scala:121) ... 23 more Caused by: org.geotools.data.DataSourceException: I/O error reading image metadata! at org.geotools.gce.geotiff.GeoTiffReader.getHRInfo(GeoTiffReader.java:584) at org.geotools.gce.geotiff.GeoTiffReader.<init>(GeoTiffReader.java:274) ... 28 more Caused by: javax.imageio.IIOException: I/O error reading image metadata! at it.geosolutions.imageioimpl.plugins.tiff.TIFFImageReader.readMetadata(TIFFImageReader.java:887) at it.geosolutions.imageioimpl.plugins.tiff.TIFFImageReader.seekToImage(TIFFImageReader.java:834) at it.geosolutions.imageioimpl.plugins.tiff.TIFFImageReader.getImageMetadata(TIFFImageReader.java:1446) at org.geotools.gce.geotiff.GeoTiffReader.getHRInfo(GeoTiffReader.java:340) ... 29 more Caused by: java.io.EOFException at javax.imageio.stream.ImageInputStreamImpl.readShort(ImageInputStreamImpl.java:229) at javax.imageio.stream.ImageInputStreamImpl.readUnsignedShort(ImageInputStreamImpl.java:242) at it.geosolutions.imageioimpl.plugins.tiff.TIFFIFD.initialize(TIFFIFD.java:237) at it.geosolutions.imageioimpl.plugins.tiff.TIFFImageMetadata.initializeFromStream(TIFFImageMetadata.java:148) at it.geosolutions.imageioimpl.plugins.tiff.TIFFImageReader.readMetadata(TIFFImageReader.java:881) ... 32 more

fmendezlopez avatar Feb 21 '25 08:02 fmendezlopez