sedona
sedona copied to clipboard
Writing to multiple GeoParquet files will not output _metadata
Expected behavior
When writing out a GeoParquet dataframe that results in multiple files, the _metadata summary file will not be created when configured to do so.
import sedona
from sedona.spark import *
sedona = SedonaContext.create(spark)
print("spark version: {}".format(spark.version))
print("sedona version: {}".format(sedona.version))
spark.conf.set("parquet.summary.metadata.level", "ALL")
def write_geoparquet(df, path):
df.write.format("geoparquet") \
.option("geoparquet.version", "1.0.0") \
.option("geoparquet.crs", "") \
.option("compression", "zstd") \
.option("parquet.block.size", 16 * 1024 * 1024) \
.option("maxRecordsPerFile", 10000000) \
.mode("overwrite").save(path)
df = sedona.read.format("geoparquet").option("mergeSchema", "true").load(input_path)
write_geoparquet(df, output_path)
If the number of records exceeds maxRecordsPerFile so that more than one file is written, the _metadata
and _common_metadata
files will not be written. When there are fewer records that only one file is written, then _metadata
and _common_metadata
will be created.
However if I change the above to write parquet instead of geoparquet:
def write_parquet(df, path):
df.write.format("parquet") \
.option("compression", "zstd") \
.option("parquet.block.size", 16 * 1024 * 1024) \
.option("maxRecordsPerFile", 10000000) \
.mode("overwrite").save(path)
write_parquet(df, output_path)
Then _metadata
and _common_metadata
will be written even with multiple files. Is there a setting or other way to enable writing the common metadata files?
I'd like to write these files as reading in full datasets from pyarrow or others will not need to fully scan all files which can be time-consuming for large datasets.
Settings
Sedona version = 3.4.1 Apache Spark version = 3.4.1
Environment = Databricks