sedona icon indicating copy to clipboard operation
sedona copied to clipboard

Unexpected performance querying geoparquet vs parquet

Open joe-easley opened this issue 10 months ago • 6 comments

Hi

I'm trying to setup Sedona to run spatial intersection queries on a multi-file geoparquet 1.1.0 dataset that I've generated with GeoPandas. The total dataset size is approximately 2.5Gb split across 4 files.

dataframe.to_parquet(outfile, write_covering_bbox=True, schema_version='1.1.0')

I'm seeing some unexpected behaviour where reading the file in as a parquet file results in much better performance relative to geoparquet. When I load the dataframe in as parquet and then create the geometry, my query completes in ~3-5s. When reading in natively as geoparquet it takes ~15s. In both cases this is the query I've run.

SELECT * FROM water_with_geom_(g)pq where ST_Intersects(ST_GeomFromWKT('POLYGON ((411908 128831, 411927 133556, 416895 134004, 417044 128326, 411908 128831))', 27700), geometry)

My GeoParquet workflow is:

val geo_df = sedona.read.format("geoparquet").load("dbfs:/FileStore/tables/geoparquet_investigation/water")

geo_df.createOrReplaceTempView("water_with_geom_gpq")

And for Parquet:

val df = sedona.read.format("parquet").load("dbfs:/FileStore/tables/geoparquet_investigation/water")
df.createOrReplaceTempView("temp_water")
val geom_df = sedona.sql("SELECT *, ST_GeomFromWKB(geometry) as geom from temp_water")
val columnsToDrop = Seq("geometry")
val geom_df_dropped = geom_df.drop(columnsToDrop: _*)
val geometry_df = geom_df_dropped.withColumnRenamed("geom", "geometry")
geometry_df.createOrReplaceTempView("water_with_geom_pq")

My environment is a databricks cluster (DBR 15.4, Spark 3.5, Scala 2.12) running Sedona 1.7.0

Would greatly appreciate if someone could point out where I'm going wrong - am very new to Sedona and fairly new to all things Spark!

joe-easley avatar Mar 07 '25 09:03 joe-easley

Thank you for your interest in Apache Sedona! We appreciate you opening your first issue. Contributions like yours help make Apache Sedona better.

github-actions[bot] avatar Mar 07 '25 09:03 github-actions[bot]

@Kontinuation is the difference caused by the vectorized parquet reader?

jiayuasu avatar Mar 10 '25 16:03 jiayuasu

Thanks @jiayuasu / @Kontinuation - would non-vectorized read be expected behaviour for the geoparquet reader? If not, are there plans to support reading in that way?

joe-easley avatar Mar 12 '25 09:03 joe-easley

@joe-easley To further confirm this, can you try to set SparkConfig spark.sql.parquet.enableVectorizedReader to false on Databricks and measure the performance again?

jiayuasu avatar Mar 12 '25 23:03 jiayuasu

@jiayuasu - just set that and the read performance is now pretty much identical ~15s for both geoparquet and parquet

joe-easley avatar Mar 13 '25 09:03 joe-easley

@joe-easley Thanks for testing this out. We will be implementing the vectorized reader of GeoParquet soon!

jiayuasu avatar Mar 14 '25 19:03 jiayuasu