mosaic icon indicating copy to clipboard operation
mosaic copied to clipboard

Unable to use vector format reader for Azure Blob Storage path (abfss)

Open casperdamen123 opened this issue 1 year ago • 3 comments

Describe the bug When following these instructions to read vector format, I can only read data from dbfs locations such as tmp. It's not possible to load directly from abfss:// paths. I'm sure the connection between the cluster and Azure Blob Storage is there since I'm able to load other data format already, such as JSON or XML.

To Reproduce Try to load .gpkg or other vector format using these instructions

df = mos.read().format("multi_read_ogr")\
    .option("driverName", "gpkg")\
    .option("layerName", "buurten")\
    .option("asWKB", "false")\
    .load(<abfss_path>)

Expected behavior Would expect to be able to load from abfss:// as possbile with other data sources such as JSON, XML etc

Screenshots Error when loading from abfss image

No error after copying to dbfs/tmp and loading from this location image

Additional context The same issue persists when using the spark.read.format("ogr") option instead of multi_read_ogr

casperdamen123 avatar Aug 16 '23 15:08 casperdamen123

Also experience this with s3 where I am unable to read with ogr.

dwsmith1983 avatar Sep 05 '23 10:09 dwsmith1983

I am wondering if the problem that cloud object storage does not support random access which is utilized by geopackage [from here]?

Does GeoPackage replace Shapefile? A It could but it doesn’t have to. If all you need is simple exchange and display then GeoPackage may be overkill and something like GeoJSON may be more appropriate. If you are looking for database capabilities like random access and querying then GeoPackage is a platform-independent, vendor-independent choice. GeoPackage was carefully designed this way to facilitate widespread adoption and use of a single simple file format by both commercial and open-source software applications — on enterprise production platforms as well as mobile hand-held devices.

If random access is the hangup with using GDAL + DBFS then you might want to use a UDF pattern that allows you to copy the file to local SSD and then perform operations, e.g. using geopandas. Parallel gpkg reading with UDF is not too bad of a choice, unless the gpkg files themselves are excessively large. There is still some more looking to do on this, so if the issue turns out to be something other than random access, we will comment further.

mjohns-databricks avatar Sep 05 '23 13:09 mjohns-databricks

Mosaic GDAL integration has VSI interface implemented for DBFS (and in memory). This will transfer to Volumes as well once Mosaic support for DBR 13.3 is delivered. It is custom work to implement VSI for S3, ADLSGen2, and GCS. We are not convinced that is the right approach as it then skips Unity Catalog. Regardless, next step is to deliver Mosaic support for DBR 13.3.

mjohns-databricks avatar Sep 28 '23 13:09 mjohns-databricks