mosaic
mosaic copied to clipboard
Error reading some raster files using mos.read. Size issue?
Hello.
Second file (87mb) works. First (7.9GB) does not.
I recall there was an issue with reading files larger than 2GB, but I thought that this had been resolved with Mosaic 0.4. So is it something else?
Actually, looking at the release notes, maybe the change did not make it into 0.4?
https://github.com/databrickslabs/mosaic/releases/tag/v_0.4.1
@milos-colic will have an authoritative answer here, but I think you'll need to use the 'retile_on_read' strategy for reading large rasters since there's no way around the 2GB limit on each row object in Spark.
raster_df = (
spark.read
.format("gdal")
.option("raster.read.strategy", "retile_on_read") # sets the reader strategy
.option("sizeInMB", "42") # sets the upper bound for size of raster in each row in the output dataframe
.load("/path/to/file")
)
I didn't realise this was available in the options. I'll try it out and get back to you. I think it would be good to explicitly call this out in the documentation by the way? Thanks.
Agreed. Hope it helps you make progress.
Hi @sllynn . No luck unfortunately. I'm just trying to turn a raster into a H3 table. This is my code:
raster_df = (
spark.read
.format("gdal")
.option("raster.read.strategy", "retile_on_read")
.option("sizeInMB", "42")
.load("dbfs:/ghsl/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.tif")
.select(mos.rst_rastertogridavg('tile', F.lit(9)).alias("result"))
.select(F.explode('result'))
.select(F.explode('col').alias('my_array'))
.select(F.col("my_array.cellID").alias("cellID"), F.col("my_array.measure").alias("measure"))
.write.parquet("dbfs:/ghsl/h3/")
)
Error is below:
Could it be because my raster is in CRS 54009 rather than WGS84?
The file is available here if you/anyone wants to try to debug:
https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100/V1-0/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.zip
I will try to complete the process using a WGS84 version of the file in the meantime ...
Failed again on the WGS84 version of the file.
Just wanted to add that the 'retile on read' option does work. It was the next stage of my code (converting to H3) that is causing the crash.
I should add that the retile on read is very slow. I find myself wondering why it is physically re-writing our smaller files. Why not just leverage VRTs?
@JimShady we are giving attention to "raster_to_grid" in #556 which gets into retile, will come with 0.4.2 in about a week.
We got 0.4.2 out, but it didn't include the raster_to_grid and similar work involving tessellate performance. We had to streamline it due to a dependency issue that arose from latest geopandas, see docs. So, 0.4.3 coming soon with more "in-flight" work.