duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

GeoParquet Support?

Open marklit opened this issue 2 years ago • 18 comments

This extension was compiled with GDAL 3.6.3 which has support for GeoParquet (it was added in 3.5.1). Any idea why it states for the format is unsupported?

$ /Volumes/Seagate/duckdb_spatial/build/debug/duckdb -unsigned test.duckdb
LOAD '/Volumes/Seagate/duckdb_spatial/build/debug/extension/spatial/spatial.duckdb_extension';
select * from st_read('/Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq') limit 1;
ERROR 4: `/Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq' not recognized as a supported file format.
Error: IO Error: Could not open file: /Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq (`/Volumes/Seagate/open5g_data/microsoft_roads_ai/Oceania_AUS.gpq' not recognized as a supported file format.)

The file itself looks fine.

$ ogrinfo microsoft_roads_ai/Oceania_AUS.gpq
INFO: Open of `microsoft_roads_ai/Oceania_AUS.gpq'
      using driver `Parquet' successful.
1: Oceania_AUS (Line String)

marklit avatar Apr 06 '23 09:04 marklit

Yeah it is not supported yet. We already have our own Parquet reader extension in DuckDB and we are looking into how to integrate that with this extension in a natural way. I haven't tested it, but you should maybe be able to use the parquet extension to load the gpq and simply convert the wkb binary columns into geometries using ST_GeomFromWKB

You can see the supported drivers using SELECT * FROM st_list_drivers()

Maxxen avatar Apr 06 '23 09:04 Maxxen

I should explain: It is not supported because we don't bundle the Arrow library (which provides the parquet driver)

Maxxen avatar Apr 06 '23 09:04 Maxxen

I haven't tested it, but you should maybe be able to use the parquet extension to load the gpq and simply convert the wkb binary columns into geometries using ST_GeomFromWKB

Just confirming that this works, and in fact works really well. The issue I guess is that while you can process your geoparquet file, you just can't save it back to geoparquet. The good news is that some external libraries might treat it as geoparquet anyway. You might lose some infomation, like CRS though. I tested this in the R package geoarrow, and read_geoparquet_sf will read the file exported from duckdb fine, just doesn't keep the CRS.

rdenham avatar Jul 08 '23 05:07 rdenham

Just wanted to echo @rdenham 's comment that it's a real shame to lose all the metadata, especially CRS this way.

This is actually very awkward for designing applications where we care about coordinate reference systems and can't anticipate them ahead of time. It's also very confusing to users who can't easily figure out why duckdb spatial cannot use st_read_meta on the one spatial vector format that seems 'most native' to duckdb.

Would it be possible to somehow modify the behavior of st_read_meta so that it could use GDAL for that purpose when reading a geoparquet file?

cboettig avatar Mar 22 '24 22:03 cboettig

Ill just share that native Geoparquet support is planned to be the next big feature i work on for spatial, im just going to wrap up some refactoring and documrntation work first!

Maxxen avatar Mar 23 '24 05:03 Maxxen

@Maxxen Just to get an idea, do you have an estimate time for when GeoParquet/GeoArrow would be supported in the spatial extension?

ncclementi avatar May 07 '24 14:05 ncclementi

I currently have basic writing and reading working, with the "bbox" and "geometry_types" fields in the metadata being properly populated, but CRS handling is blocked since we can't store projection information in the geometry type itself yet and thats a more involved feature for the future as it is going to require a lot of changes in the DuckDB core. Although you can (even today) access the geoparquet metadata using DuckDB's existing parquet_kv_metadata('path') function

However we are currently busy preparing for the next version of DuckDB scheduled to be released in two weeks and I don't think my changes so far are going to make it in until then as there are more pressing PR's and bugfixes to get in. Ill post an update in this thread once I got initial geoparquet support available on nightly.

Maxxen avatar May 08 '24 14:05 Maxxen

Hey- any update on this?

ppasquet avatar Jun 04 '24 13:06 ppasquet

Also interested! We are using geoparquet pervasively at @onefact with our campaigns: https://www.payless.health/payless.health-linknyc-campaign.jpg and geospatial work (https://onefact.github.io/new-york-real-estate/ is one example).

jaanli avatar Jun 04 '24 14:06 jaanli

I don't know what Max's plans are but last month I saw a lot of activity around projects trying to add native geometry, spatial indices, more spatial-centric storage to Parquet, ORC, etc... and GPU-friendliness.

  • https://docs.google.com/document/u/0/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/mobilebasic
  • https://github.com/apache/parquet-format/pull/240
  • https://docs.google.com/document/d/1iVFbrRNEzZl8tDcZC81GFt01QJkLJsI9E2NBOt21IRI/edit#heading=h.rt0cvesdzsj7

If any of the above turns into code and formal specifications at some point, there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.

marklit avatar Jun 04 '24 17:06 marklit

there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.

To be clear, the upcoming GeoParquet 1.1 includes native support for spatial partitioning

kylebarron avatar Jun 04 '24 18:06 kylebarron

there could be a big upgrade on GeoParquet. Especially since it never got spatial-centric indices.

To be clear, the upcoming GeoParquet 1.1 includes native support for spatial partitioning

So, in essence GeoParquet 1.1 would provide a native spatial index?

ppasquet avatar Jun 04 '24 19:06 ppasquet

I think thats overselling it a bit, but in essence you get bounding box statistics per row group that potentially allow you to skip scanning entire groups if the parquet file is created in such a way so that the rows are spatially correlated.

For DuckDB that means you would have to sort, and provide the expected bounds up front, or do another pass over all the input data to calculate the extent first.

Maxxen avatar Jun 04 '24 19:06 Maxxen

Right, I'd argue the difference lies in spatial "indexing" vs "partitioning", where I consider indexing to mean that the bounding box of every row is known, whereas partitioning means the the bounding box of each chunk is known

kylebarron avatar Jun 04 '24 20:06 kylebarron

And for the record, @Maxxen support for geoparquet 1.1 is coming to duckdb soon right?

jatorre avatar Jun 06 '24 17:06 jatorre

Here's the PR for part 1: Minimal GeoParquet 1.0 support.

When the spatial extension is installed and loaded, reading from a geoparquet file through DuckDB's normal parquet functionality will now automatically convert to GEOMETRY. There's also a new GeoParquet copy format that will WKB-encode GEOMETRY columns automatically and write the 2D bbox and geometry_types column-level geoparquet metadata.

https://github.com/duckdb/duckdb/pull/12503

There's a bunch of design-decision for handling the cross-extension dependencies here that I expected I'll receive a lot of feedback on, but once that gets resolved moving on to supporting 1.1 should be relatively straight-forward.

Maxxen avatar Jun 12 '24 15:06 Maxxen

How are things looking for supporting geoparquet 1.1, in particular arrow encoded geometries? Are the required design-decisions mentioned above resolved? Can we do anything to help push this along?

cfis avatar Jan 31 '25 23:01 cfis

Parquet specification now includes geometries: https://cloudnativegeo.org/blog/2025/02/geoparquet-2.0-going-native/.

frafra avatar Feb 24 '25 16:02 frafra