spec: remove the longitude wraparound of the geometry type bbox
Rationale for this change
Sync the recent Iceberg spec change to Parquet: https://github.com/apache/iceberg/pull/14250
Geometry type calculation should always be cartesian. A reader / writer should be able to read / write geometry data without understanding CRS.
What changes are included in this PR?
- Remove the longitude wraparound of the geometry type bbox
- Add clarification in the geometry type.
Do these changes have PoC implementations?
No
@wgtmac please review when you have time. Thank you!
@paleolimbot Iceberg uses the underlying Parquet statistics for its own statistics. So if we don't sync the Iceberg change to Parquet, how would Iceberg side handle the bbox wraparound if the underlying Parquet file has one?
I think it's fair that Parquet files with wraparound bounds can't be used in Iceberg V3 tables. Iceberg is one of many, many uses of a Parquet file.
Ok, I think what Dewey said makes sense. I don’t believe we necessarily need to update the Parquet spec, since the Iceberg side can simply avoid using wraparound longitude for geometry bounding boxes. Of course, we’re not Parquet PMC members, so if the Parquet PMC feels strongly about it, we can start a vote or continue iterating on the PR.
I agree with @paleolimbot that Parquet does not have to be bound by Iceberg, especially when there are Parquet implementations with wraparound support. However my concern is that Iceberg may come up with a different solution later which diverges from the Parquet spec.
However my concern is that Iceberg may come up with a different solution later which diverges from the Parquet spec.
I think it's fine to come up with another solution...I don't think there's any solution would change the existing meaning of xmin and xmax for geometry. We currently have a situation where the same math can be used to interpret GeospatialStatistics regardless of the logical type and am just not sure what it will accomplish to diverge from that.
It seems that the two proposals from https://lists.apache.org/thread/x9ll3rhg26mngm10cjn74w66ov23grmm may add additional attributes to the geospatial logical types. I'm not sure it may break the assumption of current Parquet implementations.
BTW, do you want to port the wraparound logic from arrow-rs to arrow-cpp? @paleolimbot
BTW, do you want to port the wraparound logic from arrow-rs to arrow-cpp?
Yes! I had hoped for Arrow 22 but given the feature freeze tomorrow I think that's unlikely (I can definitely ensure this and the null count are updated for Arrow 23).
It seems that the two proposals may add additional attributes to the geospatial logical types.
I don't think either of those options intends to make changes to the Parquet geometry type that would break the assumptions of existing readers (I suppose the second one adds an option to the geography enum, which would require readers to update their enums). From the Parquet end of things the first option is just a way to signal to Parquet via writer options that wraparound_hint should be set.
In any case, I don't think these are planning to assign alternative meaning to xmin and xmax that would make it necessary to force existing implementations to reject xmax > xmin?
Yes! I had hoped for Arrow 22 but given the feature freeze tomorrow I think that's unlikely (I can definitely ensure this and the null count are updated for Arrow 23).
nit: I did the feature freeze for Arrow v22 ~1 hour ago (was scheduled yesterday but gave a little more time). We are still fixing blockers though.