mosaic icon indicating copy to clipboard operation
mosaic copied to clipboard

`raster_to_grid` not working when using `retile` with `GeoTIFF` file

Open carlosg-m opened this issue 9 months ago • 2 comments

  • DBR 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
  • Standard_DS13_v2 (driver and 2 workers)
  • Photon disabled
  • databricks-mosaic 0.4.1
  • GDAL init script installed in cluster
  • Dataset is a GeoTIFF relatively large raster file, it has one layer. Each pixel in uint16 dtype represents a category that describes the land use or land cover (forest, industrial, agricultural, and so on). Values out of bounds are masked (represented by the last integer 2^16 - 1)
  • Dataset source: https://geo2.dgterritorio.gov.pt/cosc/COSc2023.zip
  • Use case: need to efficiently read and process raster file represented in a projected coordinate system, convert to grid index to intersect with points represented in a geographic coordinate system (WGS84).
  • The GeoTiff file seems ok as far as I know, I've tested it with Rasterio and NumPy.
  • I'm trying to automate the process with Mosaic.

This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

When trying to use "retile" option it throws an error:

import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)

df = mos.read().format("raster_to_grid")  \
        .option("resolution", "2") \
        .option("retile", "true")\
        .option("tileSize", "1000")\
        .load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.

Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).

carlosg-m avatar May 03 '24 15:05 carlosg-m

Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.

mjohns-databricks avatar May 04 '24 15:05 mjohns-databricks

Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.

Thank you for the response, @mjohns-databricks. Are there any best practices to make the operation I described efficient, using the current version of Mosaic?

The workaround is to generate a table of "reading windows or bounding boxes" and with a UDF go through each one in parallel, loading each window with Rasterio and converting it to a "grid index".

carlosg-m avatar May 06 '24 00:05 carlosg-m