POC: GeoParquet reader refactor to avoid making duplicate wrappers of upstream structs
Right now, we expose a bunch of wrapper structs: GeoParquetRecordBatchReaderBuilder, GeoParquetRecordBatchReader, GeoParquetRecordBatchStreamBuilder, GeoParquetRecordBatchStream that literally are just wrapping their upstream counterparts. It's a lot of duplication and maintenance overhead for us and doesn't allow users the full amount of control of reading Parquet that the upstream implementation provides. Currently, these wrapper readers don't provide much flexibility. You can't manually choose row group indices nor apply multiple row filters.
For example, @gadomski might want to push down other predicates down into STAC-GeoParquet, such as datetime. But right now, we'd have to add more code to our wrappers to pipe all those predicates through to the underlying readers ourselves.
@paleolimbot got me started thinking about this in https://github.com/geoarrow/geoarrow-rs/pull/1050#pullrequestreview-2770538752, where for example in https://github.com/geoarrow/geoarrow-rs/pull/1040 a user might want to write GeoJSON with UUID types, where the end user might want more control over encoding different types. Likewise here, if we remove our own wrapper structs and just insert our functionality into the upstream readers via traits, then that gives users more control and greatly simplifies our maintenance here.
Hopefully this will also make it easier to integrate GeoParquet into DataFusion without duplicating all of the upstream Parquet TableProvider.
Change list
-
Removes wrapper structs. Means a significantly lower maintenance overhead here, and we don't need to do anything to support new upstream functionality.
-
New
GeoParquetReaderBuildertrait that extends the upstreamArrowReaderBuilder. This allows users to implement a spatialRowFilterdirectly on an upstream builder instance. -
The
GeoParquetReaderBuilderalso helps with accessing the parsedGeoParquetMetadatafrom the Parquet file metadata. -
No schema translations happen by default. This has pros and cons:
Pros: it's unclear when the performance benefits of using native arrays of coordinates outweight accessing WKB coordinates directly. @paleolimbot argues that WKB encoding is faster than I give it credit for. And in some cases, such as when rewriting files or working with non-spatial data in GeoParquet files, it's just a waste of time to parse the WKB.
Cons: output data won't have GeoArrow metadata by default. If the source data has GeoParquet metadata but no embedded Arrow schema or no Arrow metadata in the embedded Arrow schema, then the inferred Arrow schema returned by the
ParquetRecordBatchReaderwon't have any geo metadata.In theory, we could put a trait on the
ParquetRecordBatchReader, but I don't think we could change upstream'sIteratorimpl. So this means a user will have to callapply_geoarrow_metadata(or similar name) on the record batches exposed from the iterator:et batches = reader .map(|batch| apply_geoarrow_metadata(batch?, geoarrow_schema.clone())) .collect::<Result<Vec<_>, _>>() .unwrap();Or maybe there's some way to override how the upstream builder infers its arrow schema? There is a top-level
parquet_to_arrow_schemawe could wrap. -
Removed
GeoParquetReaderOptions. This was just a copy of options that the user can apply directly toArrowReaderBuilder. -
Updated
exampleandexample_crstests
Todo
- [ ] Expose public function for an end user to create a spatial
ArrowPredicate. So if the user wants to apply multiple predicates in a singleRowFilter, they can do that. - [ ] Restore docs, existing tests
This is awesome, and thank you for considering it!
I think you have the right idea here...I think of a GeoParquet scan as a normal Parquet Scan + a Projection (i.e., maybe remove the bbox column by default, maybe transform the geometry column into another type) + a filter (apply a query bbox). In DataFusion, you can express all three as components in a logical plan (slightly easier) or a physical plan (slightly more verbose) and let DataFusion take care of the details. Here, you have to invent those yourself but it might be worth keeping the syntax similar.
I updated this PR and wrote new documentation. @gadomski @paleolimbot if you'd be interested in giving a review I'd appreciate it!
This PR could additionally use tests with bounding box covering columns, but that data doesn't yet exist in geoarrow-data and the write side of the geoparquet crate doesn't yet support generating those columns IIRC