Error reading file greater than 2GB

Open chitra-psg opened this issue 2 years ago • 1 comments

Expected behavior

Expecting tif files of all size to be readable

sample_raster = sedona.read.format("binaryFile").load(vFilePath)
.withColumn("raster", expr("RS_FromGeoTiff(content)"))

sample_raster .createOrReplaceTempView("sample_raster")

Actual behavior

Error while reading file dbfs:/mnt/XYZ/sample.tif. Caused by: SparkException: The length of dbfs:/mnt/XYZ/sample.tif is 11894624583, which exceeds the max length allowed: 2147483647.

Steps to reproduce the problem

User files of size greater than 2GB as source

Settings

Sedona version = 1.5.0

Dec 17 '23 12:12 chitra-psg

@chitra-psg

There is no direct way to fix this if you use Databricks. Sedona's in-memory raster computation engine is not intended to load large GeoTiff in-memory. It is designed to deal with a massive amount of small geotiff images.

The correct way to do this is, split this huge image to small tiffs files on S3, then load them to Sedona.

SedonaDB from Wherobots (https://wherobots.com/) offers a new raster processing mode called out-db mode (https://docs.wherobots.services/latest/references/havasu/raster/out-db-rasters/). It can solve this exact problem.

df = sedona.read.format("binaryFile")
               .load("s3a://XXX/*.tif")
               .drop("content").withColumn("rast", expr("RS_FromPath(path)"))
df.selectExpr("RS_TileExplode(rast) as (x, y, rast)").show()

If you are interested, please try it on Wherobots Cloud (https://www.wherobots.services/)

Dec 18 '23 05:12 jiayuasu