More user control writing GeoParquet
The current GeoParquet writer seems to lack some user controls. Unless I've missed something, there's no way to specify a primary geometry column (one will be chosen for you), and coverage columns appear to be entirely unimplemented.
I would be willing to take on this development task. My current use case doesn't require multiple geometry columns, however that seems like very low hanging fruit. I already have covering columns from previous processing, so being able to associate a covering column with an existing geometry column would be functionality I would like to see implemented. It seems reasonable to also include (optional) covering column generation when that data is desired, but does not yet exist. Based on the existing code it seems like enhancing GeoParquetWriterOptions would be the correct path here, but any input here is welcome.
Yes, this would be great!
I think the simplest approach would be to generate the covering column when writing the files, as then we can be sure that the designated column does actually represent a covering of the specified geometry column
fwiw I'm doing some Parquet prototyping for general non-geo files in https://github.com/kylebarron/arro3/pull/313, and then hopefully using some learnings from that to decide on the right API for GeoParquet here.
That sounds great. I do have some in-progress work on this issue, but unfortunately some high priority items got to the top of my list and I haven't had much time to sit down and really churn this out. As you learn more about how you'd want the API to look I would be happy to start working towards that as I have time. I suspect I'll be able to put a more concerted effort towards this in the late April or early May timeframe.
@kylebarron I'm back to having some additional bandwidth to put towards this effort. As I was reorienting myself with the changes in the broader library as well as the changes I had already started making a couple months ago I realize some of my new code seems unnecessarily awkward in part due to how the current GeoParquetWriterOptions is structured. It currently mixes file level options, like writer_properties alongside options that could be column specific, like encoding and crs_transform. This issue was further exacerbated by my in-progress work where I added an optional primary_geometry_column (file level) as well as covering_columns (column level).
I'm currently thinking it will likely make both the user API and the resulting code more clear if we separate file and column level writer options. Perhaps structuring the options similar to the GeoParquet metadata specification would allow a familiar structure for users. Something like the following:
pub struct GeoParquetWriterOptions {
pub primary_geometry_column: Option<String>,
pub writer_properties: Option<WrtierProperties>,
pub column_writer_options: Option<HashMap<String, ColumnWriterOptions>>,
}
pub struct ColumnWriterOptions {
pub encoding: GeoParquetWriterEncoding,
pub crs_transform: Option<Box<dyn CrsTransform>>,
pub covering_column: Option<CoveringInfo>,
}
What are your thoughts on this as a path forward?
I hope the refactor has been easy to follow so far. I think the refactor is unequivocally positive in making a simpler, more robust library.
I'm at a conference this week and don't have a lot of bandwidth, but I'd encourage you to read up on #1089. In particular, I'm thinking that maybe instead of developing full wrapper structs, it may be better to expose only functional/trait helpers, but where users use the upstream parquet APIs directly.
I agree the refactor is moving the library in positive directions. Admittedly I was mainly focused on the areas that effect this body of work, which were very easy to follow since the files mostly just moved around. I think the revised structure should make the library easier to use (and importantly easier to navigate/discover in general)
I reviewed #1089 and I like the direction it's moving. For better or for worse, I don't think the current writer implementation has the same limitations the reader does in terms of users being able to utilize all the options from the upstream. Since the current GeoParquetWriterOptions accepts the upstream's WriterProperties and the GeoParquetWriter has writer() -> &ArrowWriter<W> any controls and methods from the upstream are available.
That being said, I do see value in maintaining a similar paradigm between the reader and writer. Fundamentally, writing GeoParquet is mostly a metadata generation exercise, with the (encouraged) option to encode geometries to WKB. It's also probably prudent to note that any mixed geometry columns (including a column of geometry collections) MUST be encoded to WKB because native GeoArrow types are only supported if the entire column is of a single type. Given those requirements, if we want GeoArrow to act in a more functional manner and then pass off the actual writing to the upstream's writer it seems like we'd probably need a method that takes in a user's RecordBatch and some options, and it returns a record batch and metadata that are encoded and ready to use with the upstream writer. I don't think there's any getting away from most of the user options I've already listed above since they can't really be inferred if a user wants to deviate from the default options. Perhaps something like the following:
pub struct GeoParquetMetadataOptions {
pub primary_geometry_column: Option<String>,
pub column_writer_options: Option<HashMap<String, ColumnMetadataOptions>>,
}
pub struct ColumnMetadataOptions {
pub encoding: GeoParquetWriterEncoding,
pub crs_transform: Option<Box<dyn CrsTransform>>,
pub covering_column: Option<ArrayRef>,
}
/// Builds a GeoParquet v1.1 compliant covering column for the input geometry array
fn build_covering_column(array: NativeArrayRef) -> Result<ArrayRef>
/// Converts the input RecordBatch according to `options` and returns a RecordBatch and KeyValue metadata that can be used with
/// an instance of ArrowWriter
fn prepare_recordbatch(batch: RecordBatch, options: GeoParquetMetadataOptions) -> Result<(RecordBatch, GeoParquetMetadata)>
Here, the onus would be on the user to add and associate their generated covering column with the existing record batch.
In #1089 I merged a refactor of reading GeoParquet and now we can consider better APIs for writing GeoParquet
https://github.com/geoarrow/geoarrow-rs/pull/1214 refactors the write side
it seems like we'd probably need a method that takes in a user's
RecordBatchand some options, and it returns a record batch and metadata that are encoded and ready to use with the upstream writer.
I think https://github.com/geoarrow/geoarrow-rs/pull/1214 is pretty close to this ideal.
A follow up PR will tackle adding covering columns.
/// Converts the input RecordBatch according to
optionsand returns a RecordBatch and KeyValue metadata that can be used with
We don't want a KeyValue per batch, we want to have one per file. I think the GeoParquetRecordBatchEncoder nicely encapsulates that state.
Also, We don't need a CRS transform per column; each column has a CRS but we only need one global implementation of how to convert CRS to PROJJSON
Here, the onus would be on the user to add and associate their generated covering column with the existing record batch.
- I think it would be better to generate that as part of the encoding process so that we can be assured that the covering data is accurate for the given record batch. Perhaps in the future we can add another lower-level API for a user to pass pre-existing bounding box data if they have it.
https://github.com/geoarrow/geoarrow-rs/pull/1216 Adds support for writing a covering column