mosaic icon indicating copy to clipboard operation
mosaic copied to clipboard

Error reading some raster files using mos.read. Size issue?

Open JimShady opened this issue 10 months ago • 9 comments

Hello.

Second file (87mb) works. First (7.9GB) does not.

I recall there was an issue with reading files larger than 2GB, but I thought that this had been resolved with Mosaic 0.4. So is it something else?

image

JimShady avatar Apr 05 '24 12:04 JimShady

Actually, looking at the release notes, maybe the change did not make it into 0.4?

https://github.com/databrickslabs/mosaic/releases/tag/v_0.4.1

JimShady avatar Apr 05 '24 12:04 JimShady

@milos-colic will have an authoritative answer here, but I think you'll need to use the 'retile_on_read' strategy for reading large rasters since there's no way around the 2GB limit on each row object in Spark.

raster_df = (
  spark.read
  .format("gdal")
  .option("raster.read.strategy", "retile_on_read") # sets the reader strategy
  .option("sizeInMB", "42") # sets the upper bound for size of raster in each row in the output dataframe
  .load("/path/to/file")
)

sllynn avatar Apr 10 '24 05:04 sllynn

I didn't realise this was available in the options. I'll try it out and get back to you. I think it would be good to explicitly call this out in the documentation by the way? Thanks.

JimShady avatar Apr 10 '24 06:04 JimShady

Agreed. Hope it helps you make progress.

sllynn avatar Apr 10 '24 07:04 sllynn

Hi @sllynn . No luck unfortunately. I'm just trying to turn a raster into a H3 table. This is my code:

raster_df = (
  spark.read
  .format("gdal")
  .option("raster.read.strategy", "retile_on_read")
  .option("sizeInMB", "42")
  .load("dbfs:/ghsl/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.tif")
  .select(mos.rst_rastertogridavg('tile', F.lit(9)).alias("result"))
  .select(F.explode('result'))
  .select(F.explode('col').alias('my_array'))
  .select(F.col("my_array.cellID").alias("cellID"), F.col("my_array.measure").alias("measure"))
  .write.parquet("dbfs:/ghsl/h3/")
)

Error is below:

image

Could it be because my raster is in CRS 54009 rather than WGS84?

The file is available here if you/anyone wants to try to debug:

https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100/V1-0/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.zip

I will try to complete the process using a WGS84 version of the file in the meantime ...

JimShady avatar Apr 11 '24 19:04 JimShady

Failed again on the WGS84 version of the file.

image

JimShady avatar Apr 12 '24 08:04 JimShady

Just wanted to add that the 'retile on read' option does work. It was the next stage of my code (converting to H3) that is causing the crash.

I should add that the retile on read is very slow. I find myself wondering why it is physically re-writing our smaller files. Why not just leverage VRTs?

JimShady avatar Apr 22 '24 13:04 JimShady

@JimShady we are giving attention to "raster_to_grid" in #556 which gets into retile, will come with 0.4.2 in about a week.

mjohns-databricks avatar May 04 '24 15:05 mjohns-databricks

We got 0.4.2 out, but it didn't include the raster_to_grid and similar work involving tessellate performance. We had to streamline it due to a dependency issue that arose from latest geopandas, see docs. So, 0.4.3 coming soon with more "in-flight" work.

mjohns-databricks avatar May 15 '24 18:05 mjohns-databricks