geoarrow-rs POC: GeoParquet reader refactor to avoid making duplicate wrappers of upstream structs

Right now, we expose a bunch of wrapper structs: GeoParquetRecordBatchReaderBuilder, GeoParquetRecordBatchReader, GeoParquetRecordBatchStreamBuilder, GeoParquetRecordBatchStream that literally are just wrapping their upstream counterparts. It's a lot of duplication and maintenance overhead for us and doesn't allow users the full amount of control of reading Parquet that the upstream implementation provides. Currently, these wrapper readers don't provide much flexibility. You can't manually choose row group indices nor apply multiple row filters.

For example, @gadomski might want to push down other predicates down into STAC-GeoParquet, such as datetime. But right now, we'd have to add more code to our wrappers to pipe all those predicates through to the underlying readers ourselves.

@paleolimbot got me started thinking about this in https://github.com/geoarrow/geoarrow-rs/pull/1050#pullrequestreview-2770538752, where for example in https://github.com/geoarrow/geoarrow-rs/pull/1040 a user might want to write GeoJSON with UUID types, where the end user might want more control over encoding different types. Likewise here, if we remove our own wrapper structs and just insert our functionality into the upstream readers via traits, then that gives users more control and greatly simplifies our maintenance here.

Hopefully this will also make it easier to integrate GeoParquet into DataFusion without duplicating all of the upstream Parquet TableProvider.

Change list

Removes wrapper structs. Means a significantly lower maintenance overhead here, and we don't need to do anything to support new upstream functionality.
New GeoParquetReaderBuilder trait that extends the upstream ArrowReaderBuilder. This allows users to implement a spatial RowFilter directly on an upstream builder instance.
The GeoParquetReaderBuilder also helps with accessing the parsed GeoParquetMetadata from the Parquet file metadata.
No schema translations happen by default. This has pros and cons:

Pros: it's unclear when the performance benefits of using native arrays of coordinates outweight accessing WKB coordinates directly. @paleolimbot argues that WKB encoding is faster than I give it credit for. And in some cases, such as when rewriting files or working with non-spatial data in GeoParquet files, it's just a waste of time to parse the WKB.

Cons: output data won't have GeoArrow metadata by default. If the source data has GeoParquet metadata but no embedded Arrow schema or no Arrow metadata in the embedded Arrow schema, then the inferred Arrow schema returned by the ParquetRecordBatchReader won't have any geo metadata.

In theory, we could put a trait on the ParquetRecordBatchReader, but I don't think we could change upstream's Iterator impl. So this means a user will have to call apply_geoarrow_metadata (or similar name) on the record batches exposed from the iterator:
```
et batches = reader
    .map(|batch| apply_geoarrow_metadata(batch?, geoarrow_schema.clone()))
    .collect::<Result<Vec<_>, _>>()
    .unwrap();
```
Or maybe there's some way to override how the upstream builder infers its arrow schema? There is a top-level parquet_to_arrow_schema we could wrap.
Removed GeoParquetReaderOptions. This was just a copy of options that the user can apply directly to ArrowReaderBuilder.
Updated example and example_crs tests

Todo

[ ] Expose public function for an end user to create a spatial ArrowPredicate. So if the user wants to apply multiple predicates in a single RowFilter, they can do that.
[ ] Restore docs, existing tests

Apr 22 '25 01:04 kylebarron

This is awesome, and thank you for considering it!

I think you have the right idea here...I think of a GeoParquet scan as a normal Parquet Scan + a Projection (i.e., maybe remove the bbox column by default, maybe transform the geometry column into another type) + a filter (apply a query bbox). In DataFusion, you can express all three as components in a logical plan (slightly easier) or a physical plan (slightly more verbose) and let DataFusion take care of the details. Here, you have to invent those yourself but it might be worth keeping the syntax similar.

Apr 22 '25 01:04 paleolimbot

I updated this PR and wrote new documentation. @gadomski @paleolimbot if you'd be interested in giving a review I'd appreciate it!

May 30 '25 00:05 kylebarron

This PR could additionally use tests with bounding box covering columns, but that data doesn't yet exist in geoarrow-data and the write side of the geoparquet crate doesn't yet support generating those columns IIRC

May 30 '25 18:05 kylebarron