mosaic
mosaic copied to clipboard
`raster_to_grid` not working when using `retile` with `GeoTIFF` file
- DBR 13.3 LTS ML (includes Apache Spark 3.4.1, Scala 2.12)
- Standard_DS13_v2 (driver and 2 workers)
- Photon disabled
- databricks-mosaic 0.4.1
- GDAL init script installed in cluster
- Dataset is a GeoTIFF relatively large raster file, it has one layer. Each pixel in uint16 dtype represents a category that describes the land use or land cover (forest, industrial, agricultural, and so on). Values out of bounds are masked (represented by the last integer 2^16 - 1)
- Dataset source: https://geo2.dgterritorio.gov.pt/cosc/COSc2023.zip
- Use case: need to efficiently read and process raster file represented in a projected coordinate system, convert to grid index to intersect with points represented in a geographic coordinate system (WGS84).
- The GeoTiff file seems ok as far as I know, I've tested it with Rasterio and NumPy.
- I'm trying to automate the process with Mosaic.
This example is very slow and seems to have a lot of data skew (it gets stuck on the last task):
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)
df = mos.read().format("raster_to_grid") \
.option("resolution", "2") \
.load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()
When trying to use "retile" option it throws an error:
import mosaic as mos
mos.enable_mosaic(spark, dbutils)
mos.enable_gdal(spark)
df = mos.read().format("raster_to_grid") \
.option("resolution", "2") \
.option("retile", "true")\
.option("tileSize", "1000")\
.load("dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif")
df.show()
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 81.0 failed 4 times, most recent failure: Lost task 0.3 in stage 81.0 (TID 2434) (10.208.237.16 executor 4): com.databricks.sql.io.FileReadException: Error while reading file dbfs:/mnt/a4dprdisdl/COSc2023_N3_v0_TM06.tif.
Don't take this the wrong way, it is a pleasure to work with Shapely/Pygeos/GeoPandas and even Rasterio together with Spark and Pandas UDFs, however it is being an absolute pain navigating through Databricks-Mosaic (the same happened with Sedona and GeoSpark).
Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.
Hi @carlosg-m we are tracking the netcdf issue with "raster_to_grid" (which also gets into separate bands for any multi-band file), fix coming with [WIP] PR #556 hopefully in a week or so.
Thank you for the response, @mjohns-databricks. Are there any best practices to make the operation I described efficient, using the current version of Mosaic?
The workaround is to generate a table of "reading windows or bounding boxes" and with a UDF go through each one in parallel, loading each window with Rasterio and converting it to a "grid index".