mosaic
mosaic copied to clipboard
Unable to use vector format reader for Azure Blob Storage path (abfss)
Describe the bug
When following these instructions to read vector format, I can only read data from dbfs
locations such as tmp
. It's not possible to load directly from abfss://
paths. I'm sure the connection between the cluster and Azure Blob Storage is there since I'm able to load other data format already, such as JSON or XML.
To Reproduce
Try to load .gpkg
or other vector format using these instructions
df = mos.read().format("multi_read_ogr")\
.option("driverName", "gpkg")\
.option("layerName", "buurten")\
.option("asWKB", "false")\
.load(<abfss_path>)
Expected behavior
Would expect to be able to load from abfss://
as possbile with other data sources such as JSON, XML etc
Screenshots
Error when loading from abfss
No error after copying to dbfs/tmp
and loading from this location
Additional context
The same issue persists when using the spark.read.format("ogr")
option instead of multi_read_ogr
Also experience this with s3 where I am unable to read with ogr.
I am wondering if the problem that cloud object storage does not support random access which is utilized by geopackage [from here]?
Does GeoPackage replace Shapefile? A It could but it doesn’t have to. If all you need is simple exchange and display then GeoPackage may be overkill and something like GeoJSON may be more appropriate. If you are looking for database capabilities like random access and querying then GeoPackage is a platform-independent, vendor-independent choice. GeoPackage was carefully designed this way to facilitate widespread adoption and use of a single simple file format by both commercial and open-source software applications — on enterprise production platforms as well as mobile hand-held devices.
If random access is the hangup with using GDAL + DBFS then you might want to use a UDF pattern that allows you to copy the file to local SSD and then perform operations, e.g. using geopandas. Parallel gpkg reading with UDF is not too bad of a choice, unless the gpkg files themselves are excessively large. There is still some more looking to do on this, so if the issue turns out to be something other than random access, we will comment further.
Mosaic GDAL integration has VSI interface implemented for DBFS (and in memory). This will transfer to Volumes as well once Mosaic support for DBR 13.3 is delivered. It is custom work to implement VSI for S3, ADLSGen2, and GCS. We are not convinced that is the right approach as it then skips Unity Catalog. Regardless, next step is to deliver Mosaic support for DBR 13.3.