iceberg icon indicating copy to clipboard operation
iceberg copied to clipboard

spec: add clarification about the Geometry type calculation

Open jiayuasu opened this issue 6 months ago • 12 comments

Background

Several engines including Spark/Sedona and Presto explicitly mention the calculations on geometries are done via Cartesian mathematics. It would be good if we add clarification to the Iceberg spec

Proposed change

Added the following sentence

and geometric calculations are performed using Cartesian mathematics along straight lines

jiayuasu avatar Jun 03 '25 22:06 jiayuasu

@rdblue @szehon-ho please review :-)

jiayuasu avatar Jun 03 '25 22:06 jiayuasu

@jiayuasu Thank you for driving this!

I am not able to comment directly at appropriate place since there are no changes there, which is why I writing this as a high level comment in the PR.

In line 684 of the existing spec (prior to this PR) we state:

For geometry and geography types, lower_bounds and upper_bounds are both points of the following coordinates X, Y, Z, and M (see Appendix G) which are the lower / upper bound of all objects in the file. For the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that x >= xmin ORx <= xmax. In geographic terminology, the concepts of xmin, xmax, ymin, and ymax are also known as westernmost, easternmost, southernmost and northernmost, respectively. For geography types, these points are further restricted to the canonical ranges of [-180 180] for X and [-90 90] for Y.

I believe that in order to make the spec self-consistent it would make sense to rephrase this and specify that xmin can be greater than xmax for geographies only.

How about the following slight modification?

For geometry and geography types, lower_bounds and upper_bounds are both points of the following coordinates X, Y, Z, and M (see Appendix G) which are the lower / upper bound of all objects in the file.

For geography, for the X values only, xmin may be greater than xmax, in which case an object in this bounding box may match if it contains an X such that x >= xmin OR x <= xmax. In geographic terminology, the concepts of xmin, xmax, ymin, and ymax are also known as westernmost, easternmost, southernmost and northernmost, respectively. These points are further restricted to the canonical ranges of [-180..180] for X and [-90..90] for Y.

mkaravel avatar Jun 05 '25 00:06 mkaravel

@mkaravel I think this is relevant to our discussion in Slack. After giving some thoughts, I think it makes sense to consider that together with the linear option in Geography type. I guess we can make another PR to discuss that specific issue? This PR is just to add 1 sentence clarification about the geometric calculation

jiayuasu avatar Jun 05 '25 19:06 jiayuasu

@jiayuasu, can you help me understand why we wouldn't make the clarifications @mkaravel suggests in this PR?

If we don't expect the CRS to an effect on calculations for geometry and only geography is restricted to [-180, 180] then it makes sense to me that geometry bboxes should never have xmin > xmax. Clarifying that here where we are already noting that the calculations are purely Cartesian makes sense to me!

rdblue avatar Jun 06 '25 01:06 rdblue

@rdblue I think it makes sense to drop xmin > xmax wraparound in geometry if we don't expect CRS to have an effect.

On the other hand, I'd like to also add linear interpolation to geography if we want to drop this. This is an idea proposed by @mkaravel before. Because there will be use cases that needs linear interpolation (cartesian mathematics) on antimeridian crossing geometries on geographic CRS.

I am fine adding the sentence to disallow wraparound here and create another PR to add linear interpolation although I think the 2 issues are interconnected. Or, I can put everything in this PR instead so we can solve these issues together

Please let me know what you think

jiayuasu avatar Jun 06 '25 05:06 jiayuasu

@rdblue I think it makes sense to drop xmin > xmax wraparound in geometry if we don't expect CRS to have an effect.

On the other hand, I'd like to also add linear interpolation to geography if we want to drop this. This is an idea proposed by @mkaravel before. Because there will be use cases that needs linear interpolation (cartesian mathematics) on antimeridian crossing geometries on geographic CRS.

@jiayuasu I am obviously (since I have suggested it in the past) totally on board with this plan. This will make the semantics of the types very clean (geometry always Cartesian, calculations independent of the CRS) and allow geography to have an easy, practical, and meaningful way to deal with the antimeridian geometries (also very clean IMHO).

I am fine adding the sentence to disallow wraparound here and create another PR to add linear interpolation although I think the 2 issues are interconnected. Or, I can put everything in this PR instead so we can solve these issues together

My thinking when I proposed the changes above in this PR was to make it self-consistent and self-contained regarding the geometry type. We can follow up with a separate PR for geography or add the changes here. Either way is fine with me. Maybe a single PR is just simpler.

@rdblue Are you okay with this? Thoughts?

mkaravel avatar Jun 06 '25 14:06 mkaravel

That's great there's agreement from @jiayuasu on adding the linear algorithm to Geography . Im also ok either way to add the change here or in another pr. If we have agreement we can do it here to minimize disruptions, as the changes seem related.

The only thing to check I had is, is it safe to add a new Edge Interpolation algorithm after V3 is adopted. I am thinking yes, as implementations would just not be able to read if its a CRS or Edge algorithm they dont know.

The other thing is, I guess adding the new algorithm seems its large enough to requires a vote. cc @rdblue for thoughts?

szehon-ho avatar Jun 10 '25 16:06 szehon-ho

I agree that we can move forward here and then discuss the change to add linear separately. I think clarifying how geometry works (that it is always independent of the CRS and cartesian) is independently valuable. Then we can follow up to add linear if it is needed next.

rdblue avatar Jun 11 '25 21:06 rdblue

Let's finalize this PR then by completing Cartesian semantics for GEOMETRY and do GEOGRAPHY as a follow up.

mkaravel avatar Jun 11 '25 23:06 mkaravel

Any chance linear interpolation for GEOGRAPHY can make it to Iceberg v3?

mkaravel avatar Jun 11 '25 23:06 mkaravel

Ping geo folks and see if they have comments

@jorisvandenbossche @paleolimbot @kylebarron @cholmes @migurski @pramsey

jiayuasu avatar Jun 16 '25 18:06 jiayuasu

Thank you for the ping!

Just a note that I don't think this clarification is needed...a Cartesian bounding box like the one described here is perfectly legal in for the geometry type in GeoArrow's box type, GeoParquet's file metadata, GeoJSON, Parquet GeoStatistics, and the current Iceberg spec (even though all of these allow for the wraparound case as well). I personally found it to be a nice simplification that evaluating a filter expression against a Parquet GeoStatistics is the same whether the type is GEOMETRY or GEOGRAPHY.

I'll save my comments regarding the possibility of a "linear" edge algorithm for the GEOGRAPHY type for the PR implementing that change (the short version is: call it something other than GEOGRAPHY because that's not what that type means in any other database, file format, or spec of which I am aware).

paleolimbot avatar Jun 16 '25 20:06 paleolimbot

Just a note that I don't think this clarification is needed...a Cartesian bounding box like the one described here is perfectly legal in for the geometry type in GeoArrow's box type, GeoParquet's file metadata, GeoJSON, Parquet GeoStatistics, and the current Iceberg spec (even though all of these allow for the wraparound case as well).

How does a writer know what bounding box to produce? In all of those specs, is it the writer's decision whether to use a purely cartesian bounding box or one that wraps around?

What I want to avoid is needing to interpret the CRS to know how to compute the bounding box. If writers are allowed to produce a box that is cartesian and ignore whether longitude is periodic, then maybe that is fine because it is the norm. But my expectation is that this would be determined by the CRS and if we wanted to produce a simpler bounding box it would be incorrect without this change.

rdblue avatar Jul 07 '25 22:07 rdblue

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions[bot] avatar Aug 07 '25 00:08 github-actions[bot]

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions[bot] avatar Aug 15 '25 00:08 github-actions[bot]