Unexpected performance querying geoparquet vs parquet
Hi
I'm trying to setup Sedona to run spatial intersection queries on a multi-file geoparquet 1.1.0 dataset that I've generated with GeoPandas. The total dataset size is approximately 2.5Gb split across 4 files.
dataframe.to_parquet(outfile, write_covering_bbox=True, schema_version='1.1.0')
I'm seeing some unexpected behaviour where reading the file in as a parquet file results in much better performance relative to geoparquet. When I load the dataframe in as parquet and then create the geometry, my query completes in ~3-5s. When reading in natively as geoparquet it takes ~15s. In both cases this is the query I've run.
SELECT * FROM water_with_geom_(g)pq where ST_Intersects(ST_GeomFromWKT('POLYGON ((411908 128831, 411927 133556, 416895 134004, 417044 128326, 411908 128831))', 27700), geometry)
My GeoParquet workflow is:
val geo_df = sedona.read.format("geoparquet").load("dbfs:/FileStore/tables/geoparquet_investigation/water")
geo_df.createOrReplaceTempView("water_with_geom_gpq")
And for Parquet:
val df = sedona.read.format("parquet").load("dbfs:/FileStore/tables/geoparquet_investigation/water")
df.createOrReplaceTempView("temp_water")
val geom_df = sedona.sql("SELECT *, ST_GeomFromWKB(geometry) as geom from temp_water")
val columnsToDrop = Seq("geometry")
val geom_df_dropped = geom_df.drop(columnsToDrop: _*)
val geometry_df = geom_df_dropped.withColumnRenamed("geom", "geometry")
geometry_df.createOrReplaceTempView("water_with_geom_pq")
My environment is a databricks cluster (DBR 15.4, Spark 3.5, Scala 2.12) running Sedona 1.7.0
Would greatly appreciate if someone could point out where I'm going wrong - am very new to Sedona and fairly new to all things Spark!
Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.
@Kontinuation is the difference caused by the vectorized parquet reader?
Thanks @jiayuasu / @Kontinuation - would non-vectorized read be expected behaviour for the geoparquet reader? If not, are there plans to support reading in that way?
@joe-easley To further confirm this, can you try to set SparkConfig spark.sql.parquet.enableVectorizedReader to false on Databricks and measure the performance again?
@jiayuasu - just set that and the read performance is now pretty much identical ~15s for both geoparquet and parquet
@joe-easley Thanks for testing this out. We will be implementing the vectorized reader of GeoParquet soon!