parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-2417: Add support for geometry logical type

Open zhangfengcdt opened this issue 1 year ago • 1 comments

This PR is to provide a POC to support the proposed changes to the parquet-format to add geometry type to parquet.

Here is the proposal: https://github.com/apache/parquet-format/pull/240

Jira

  • [X] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
    • https://issues.apache.org/jira/browse/PARQUET-2471

Tests

  • [X] My PR adds the following unit tests: TestGeometryTypeRoundTrip

Commits

  • [X] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • [ ] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

zhangfengcdt avatar Jul 26 '24 15:07 zhangfengcdt

CC: @jiayuasu @Kontinuation @wgtmac

zhangfengcdt avatar Jul 26 '24 15:07 zhangfengcdt

@wgtmac please take a look :-)

jiayuasu avatar Feb 13 '25 05:02 jiayuasu

Sure, I will take a look. Thanks!

wgtmac avatar Feb 13 '25 06:02 wgtmac

I am depending on this PR to build geo support for iceberg. I got lots of test failures when building this branch locally:

java.lang.NullPointerException
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:965)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.buildColumnChunkMetaData(ParquetMetadataConverter.java:1750)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:1848)
	at org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:1728)
	at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:629)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934)
	at org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925)
	at org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:698)

NPE is thrown when reading parquet files without geo columns. Can we apply the following patch to resolve this problem?

diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
index 3efc9345..22e51783 100644
--- a/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
+++ b/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
@@ -961,6 +961,9 @@ public class ParquetMetadataConverter {
 
   static org.apache.parquet.column.statistics.geometry.GeospatialStatistics fromParquetStatistics(
       GeospatialStatistics formatGeomStats, PrimitiveType type) {
+    if (formatGeomStats == null) {
+      return null;
+    }
     org.apache.parquet.column.statistics.geometry.BoundingBox bbox = null;
     if (formatGeomStats.isSetBbox()) {
       BoundingBox formatBbox = formatGeomStats.getBbox();

Kontinuation avatar Feb 17 '25 10:02 Kontinuation

@mkaravel Thanks for the great comments.

The scope of this PR is to

  1. Add basic type read / write of Geometry and Geography type.
  2. Add the bbox calculation of Geometry, not Geography

I agree that, it is probably too much for this PR to do Geometry bbox wraparound and special handling of CRSs. @zhangfengcdt has some other questions related to NaN values. I'll leave those to him to ask.

jiayuasu avatar Mar 28 '25 22:03 jiayuasu

@mkaravel Thanks for the great comments.

The scope of this PR is to

  1. Add basic type read / write of Geometry and Geography type.
  2. Add the bbox calculation of Geometry, not Geography

I agree that, it is probably too much for this PR to do Geometry bbox wraparound and special handling of CRSs. @zhangfengcdt has some other questions related to NaN values. I'll leave those to him to ask.

@mkaravel Thanks for your detailed review and they really helped. As @jiayuasu said, we have removed some logic to handle the wraparound logic in the bbox stat calculation and make this PR really just focus on basic types and simple bbox boundary calculation for geometry type.

I see there were a lot of concerns about using -inf/inf as the initial states for the bbox, and we have reomved that and only keep the NaN for invalid/initial states in the bbox class. Also talking to @paleolimbot to sync this behavior with the C++ implementations as well.

Hope this addresses some of your concerns, but let me know if there are still concerns or comments.

Thank you again for your review!

zhangfengcdt avatar Mar 29 '25 16:03 zhangfengcdt

@zhangfengcdt Thank you for addressing the comments!

Regarding the bounding boxes, I believe I did not make myself clear enough as to what I was suggesting. I think that using +/- infinity as initialization is a great idea and it helps with the code logic and can simplify it. If I understood the logic correctly, what was strange for me was exposing bounding boxes with +/- infinity values (which in the previous implementation could happen if you had a file with just empty geometries (and possibly NULLs). Roughly speaking this is code flow I had in mind when computing boxes:

  • To compute the bounding box for a row group initialize this box to (inf, -inf, inf, -inf).
  • While looking at rows: If the geometry is empty skip it. If the geometry is not empty then merge its bounding box with the current bounding box for the group. This is where the initialization with infinite values is handy because you do not need to check for infinity explicitly. For example if the current bbox for the row group is (inf, -inf, inf, -inf) and the bbox of the current geometry is (1, 2, 3, 4), you can simply do (I am writing conceptual code here) rowGroup.xMin = Math.Min(rowGroup.xMin, geo.XMin), rowGroup.xMax = Math.Max(rowGroup.xMax, geo.XMax) and similarly for the other dimension(s). These statements (and this was my point) work if the initial coordinates of the row group bbox are +/- infinity (because this is how IEEE 754 floating point arithmetic works).
  • If at the end of processing a row group you still see a bbox of (inf, -inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have not found any non-empty geometry...

I will go over the code again, probably tomorrow and make specific suggestions, if needed, regarding what I am referring to above.

@paleolimbot Could you please share a link to the C++ implementation? Would be very interested in taking a look.

mkaravel avatar Mar 30 '25 15:03 mkaravel

Thank you for the review!

If at the end of processing a row group you still see a bbox of (inf, -inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have not found any non-empty geometry...

Probably the main issue here is that in the Thrift BoundingBox, only zmin/max and mmin/max are optional, and I forgot when reviewing that about the 100% EMPTY case for x and y. The entire BoundingBox is technically optional in Thrift, but we use an "unset" BoundingBox to definitively mean that there were no x/y values it needs to be very clear in the spec. Either way we should add a clarification that unset zmin/max and mmin/max can only occur if there were no non-NaN Z or M values encountered (if that is indeed what we meant by that).

If NaNs are the convention we land on (instead of unset BoundingBox), it's critical that that NaNs are not generated for any other case (for example, it's critical to ensure that a column chunk containing the invalid geometry LINESTRING (1 2, nan nan, 3 4), does not expose NaN, NaN, NaN, NaN as its x--y statistics, because that would result in a predicate pushdown skipping the entire array which may contain other values that do satisfy the predicate). If my reading of the current state of this PR is correct, this still needs to be fixed here.

I personally like exposing inf, -inf, inf, -inf better than NaN (currently what C++ does) because it is impossible (I think) to generate this by accident and "just works" with the formula provided in the Parquet format for interval containment (we only provided the formula for x, but it's reasonable for somebody to read it and apply the non-wraparound version to other dimensions).

Could you please share a link to the C++ implementation? Would be very interested in taking a look.

The PR is here: https://github.com/apache/arrow/pull/45459 ...and the files are geospatial_statistics.h/.cc and geospatial_util_internal.cc. Comments welcome!

paleolimbot avatar Mar 30 '25 20:03 paleolimbot

Thank you for the review!

If at the end of processing a row group you still see a bbox of (inf, -inf, inf, -inf) then expose it as (NaN, NaN, NaN, NaN). This means you have not found any non-empty geometry...

Probably the main issue here is that in the Thrift BoundingBox, only zmin/max and mmin/max are optional, and I forgot when reviewing that about the 100% EMPTY case for x and y. The entire BoundingBox is technically optional in Thrift, but we use an "unset" BoundingBox to definitively mean that there were no x/y values it needs to be very clear in the spec. Either way we should add a clarification that unset zmin/max and mmin/max can only occur if there were no non-NaN Z or M values encountered (if that is indeed what we meant by that).

If NaNs are the convention we land on (instead of unset BoundingBox), it's critical that that NaNs are not generated for any other case (for example, it's critical to ensure that a column chunk containing the invalid geometry LINESTRING (1 2, nan nan, 3 4), does not expose NaN, NaN, NaN, NaN as its x--y statistics, because that would result in a predicate pushdown skipping the entire array which may contain other values that do satisfy the predicate). If my reading of the current state of this PR is correct, this still needs to be fixed here.

I personally like exposing inf, -inf, inf, -inf better than NaN (currently what C++ does) because it is impossible (I think) to generate this by accident and "just works" with the formula provided in the Parquet format for interval containment (we only provided the formula for x, but it's reasonable for somebody to read it and apply the non-wraparound version to other dimensions).

Could you please share a link to the C++ implementation? Would be very interested in taking a look.

The PR is here: apache/arrow#45459 ...and the files are geospatial_statistics.h/.cc and geospatial_util_internal.cc. Comments welcome!

@paleolimbot Thank you for the detailed answer. You are touching upon a few important things, I think we need to figure them out.

First at a very high level: I think we should update the Parquet spec (and Iceberg I would say) to explicitly define the behavior in the edge case where the column has all its geometries empty. I also think we should first define the behavior we want and then move on with the implementation rather than try to do it the other way around.

I can see the following interesting cases/considerations which are of course can be combined (and thank you for bringing up the issue with geometries with invalid coordinates):

  • NULL values.
  • Empty geometries.
  • Geometries with invalid coordinates.

My understanding is that for the purposes of statistics computation we want to skip all such rows. Another option would be to fail reading or writing when we have geometries with invalid coordinates, but I would like to hear opinions about this direction which is quite invasive. So for now I would assume that we do not want to fail in such cases, but rather compute the statistics in a safe manner.

So the next question is how do we compute bounding boxes in the presence of such geometries (empty and invalid). Regarding empty geometries I proposed to use NaNs for the box values because NaNs are also typically used to represent the coordinates of empty points (JTS does that). It made more sense to me to have this "symmetry" (for example think of extracting the west-south corner of a box as a point: if you use NaNs you immediately get an empty point which makes sense to be conceptually). This is why I proposed to use NaNs for the case of entire column of empty geometries.

As you very well pointed out, this leaves us with a question about how to deal with geometries with invalid coordinates (the question is even more interesting for geographies, but let's leave this for another thread/discussion). I can see two kinds of invalid coordinates: NaNs and infinity values. Everything else (for geometries) should be considered as valid.

Let's assume now that we have a bounding box computation algorithm that will just compare coordinates using any double comparison operator except != (I believe the JTS implementation does that). I am also assuming that a bounding box computation algorithm will also skip empty point, empty points in multipoints, and empty points inside geometry collections, appearing either as points or points inside a multipoint (I believe the JTS implementation also does that). In this case any NaN coordinates in the input will surface as NaNs in the resulting box. Inf coordinates may or may not surface up. In the end for a given geometry we end up with a bounding box that possibly has NaN and inf coordinates. By thinking is that we can ignore geometries (for the row group bounding box computation) that have at least one NaN coordinate in their box and keep the inf coordinates as they are. So basically, the flow for computing the bounding box for a group of geometries would be like:

  • Initialize the box to (inf, -inf, inf, -inf).
  • Compute the bounding box of the current geometry.
  • If any of the box's coordinates is NaN ignore that box. Otherwise merge it with the existing box for the group.
  • If the box is still (inf, -inf, inf, -inf) output (NaN, NaN, NaN, NaN) for the group. Otherwise, output the box we have already computed.

WDYT?

The important step IMHO here is the phrase "If any of the box's coordinates is NaN ignore that box". We decide what to do with the box of a given geometry by looking at all its coordinates, rather than doing it separately per coordinate.

mkaravel avatar Apr 04 '25 13:04 mkaravel

I took a pass through here to check the corner case behaviour against the C++ PR after rewriting the example files to deal more intentionally with the issues we've run into ( apache/parquet-testing#70 ). I think the only inconsistency left is that a completely empty column chunk here will export xmin/ymin/xmax/ymax as either all zeroes or all NaNs to Thrift (whatever happens if you don't set a required Thrift field). In C++ I export these as Inf/Inf/-Inf/-Inf...I still think this is the least ambiguous way to do this but any other convention you all agree (e.g., all NaNs or unsetting the entire BoundingBox) is fine with me.

I still think that returning (inf, -inf, inf, -inf) will be confusing. Technically it is not a valid bounding geometric box because the order of the bounds is reversed. Personally I prefer returning NaNs instead (see also my comment above).

mkaravel avatar Apr 04 '25 13:04 mkaravel

This is why I proposed to use NaNs for the case of entire column of empty geometries.

This works for me, although I'd like a +1 from @wgtmac before changing the C++ implementation + test files! This is also consistent with what R and numpy will give you if you try to take the max() of an empty range.

In this case any NaN coordinates in the input will surface as NaNs in the resulting box.

This is only true for JTS (GEOS just ignores NANs when computing a min/max for a dimension, lwgeom/PostGIS restarts interval computation after it sees an nan). I think your strategy is a good one for JTS, but I also think it's OK to do anything that won't result in accidentally excluding the entire row roup (i.e., a writer MAY choose to either include or exclude finite coordinates from geometries that contain nan values when writing statistics, or non-points that contain NaN values have undefined behaviour but shouldn't affect valid geometries in the same row group).

So for now I would assume that we do not want to fail in such cases, but rather compute the statistics in a safe manner.

Yes, I think this is best for geometries with NaNs.

paleolimbot avatar Apr 04 '25 14:04 paleolimbot

@paleolimbot Do I understand that we are in agreement with the strategy then? @wgtmac What is your opinion?

mkaravel avatar Apr 05 '25 15:04 mkaravel

I think your strategy is a good one for JTS, but I also think it's OK to do anything that won't result in accidentally excluding the entire row roup (i.e., a writer MAY choose to either include or exclude finite coordinates from geometries that contain nan values when writing statistics, or non-points that contain NaN values have undefined behaviour but shouldn't affect valid geometries in the same row group).

If you include coordinates values for geometries that contain unexpected/invalid NaN coordinates the bounding boxes can only get bigger. Although it would depend on the engine, I would expect such a situation to not affect query results for valid geometries. In general, if geometries with invalid coordinates are in the data the behavior should really be considered undefined from the query engine's perspective, and to be honest whatever this implementation does is okay as long as:

  • It does not expose these NaN values in the output bounding box at the storage level.
  • It does not skip valid geometries in the same group (which was one of your comments which I totally agree with).

I think what I propose is a simple modification of the existing implementation and satisfies these requirements.

mkaravel avatar Apr 05 '25 15:04 mkaravel

Thanks for the review @mkaravel @paleolimbot @jiayuasu!

Let me clarify some things about Parquet statistics first:

  • Parquet does not consider NaN when collecting min/max values for float/double values. So by no means we can know if any NaN value exists simply for the column statistics.
  • There is an ongoing effort to fix the above issue by introducing a total order: -NaN < -Inf < Negative < -0.0 < +0.0 < Positive < +Inf < NaN
  • Null values do not interfere with min/max values in the column statistics. If all values are null, we can deduce it by checking null_count == value_count to ignore the empty min/max values.

Based on the above facts, I think we don't need to consider null values for geometry values. We still need to take care of NaN and invalid values.

Then I have some questions:

  • What is an invalid value in a geometry feature? NaN? +/-Inf? Anything else?
  • From the above discussion, it seems that +/-Inf are invalid values in terms of a bbox. If that's true, definitely we should not make it as the final bbox to persist in the file. Is NaN a valid value in a bbox? Is it a good way to use NaN values for an empty bbox?
  • Is it a good approach to drop the entire bbox if any NaN or invalid value appears? In this way, we do not fail the writer at the cost of missing bbox. I'm in favor of this so we do not produce any confusing stats to users. It is really hard to downstream users to decide if the provided stats are reliable for predicate push down.

I think @mkaravel's proposal (pasted as below) looks reasonable:

- Initialize the box to (inf, -inf, inf, -inf).
- Compute the bounding box of the current geometry.
- If any of the box's coordinates is NaN ignore that box. Otherwise merge it with the existing box for the group.
- If the box is still (inf, -inf, inf, -inf) output (NaN, NaN, NaN, NaN) for the group. Otherwise, output the box we have already computed.

For the last step, I think it is better to drop the bbox instead of writing all NaNs to confuse users.

wgtmac avatar Apr 11 '25 09:04 wgtmac

Thank you for this!

If all values are null, we can deduce it by checking null_count == value_count to ignore the empty min/max values.

I wasn't able to find a null count for a row group in statistics for all null values (or otherwise) because (at least in C++) the statistics aren't written because the sort order is unknown? The test case in C++ for this is https://github.com/apache/arrow/blob/a14fb07155073c4625e67a8f5ef448fd80b59e65/cpp/src/parquet/column_writer_test.cc#L1999-L2024 .

For the last step, I think it is better to drop the bbox instead of writing all NaNs to confuse users.

I agree...we can also clarify in the comments of the format that an omitted bbox (when GeospatialStatistics exists) occurs if-and-only-if there are no x or y values? (And also that omitted z and/or m statistics occur if-and-only-if there were no z and/or m values, respectively, which is true today in both Java and C++). Provided that there is another way to detect the 100% null case (probably common) I'm not personally concerned about excluding 100% empty row groups (probably not that common).

paleolimbot avatar Apr 11 '25 14:04 paleolimbot

I wasn't able to find a null count for a row group in statistics for all null values (or otherwise) because (at least in C++) the statistics aren't written because the sort order is unknown?

@paleolimbot You're right. I was talking about the statistics in the column chunk metadata which is unfortunately disabled due to unknown sort order. It works for other data types.

we can also clarify in the comments of the format that an omitted bbox (when GeospatialStatistics exists) occurs if-and-only-if there are no x or y values? (And also that omitted z and/or m statistics occur if-and-only-if there were no z and/or m values, respectively, which is true today in both Java and C++)

I think we need do this. Usually the spec should talk about the views from writer and reader. We could suggest writers not to produce bbox if x/y has any NaN and unset z/m axis if any NaN exist and readers to ignore the (malformed) bbox if any NaN value exists.

wgtmac avatar Apr 11 '25 15:04 wgtmac

I think we need do this. Usually the spec should talk about the views from writer and reader. We could suggest writers not to produce bbox if x/y has any NaN and unset z/m axis if any NaN exist and readers to ignore the (malformed) bbox if any NaN value exists.

I am bit confused by this. Let's say that some data have empty geometries on purposes, what does it mean to "skip reading" these? Do we want to pretend that these values do not exist. I find this strange.

mkaravel avatar Apr 18 '25 00:04 mkaravel

Then I have some questions:

What is an invalid value in a geometry feature? NaN? +/-Inf? Anything else? From the above discussion, it seems that +/-Inf are invalid values in terms of a bbox. If that's true, definitely we should not make it as the final bbox to persist in the file. Is NaN a valid value in a bbox? Is it a good way to use NaN values for an empty bbox? Is it a good approach to drop the entire bbox if any NaN or invalid value appears? In this way, we do not fail the writer at the cost of missing bbox. I'm in favor of this so we do not produce any confusing stats to users. It is really hard to downstream users to decide if the provided stats are reliable for predicate push down.

@wgtmac Just wanted to share my view/opinion on this:

  • In a geometry feature NaN or +/-inf values do not make sense. +/-inf values could make sense in a geometric (not geographic) bounding box, but this would be a convention.
  • Given the above statement, +/-inf values in a bounding box could potentially be persisted but this would be corrupt or invalid data. The engine/reader could choose how to handle them.
  • I think using NaN values for representing empty boxes makes a lot of sense and it provides information compared to dropping a box of NaN values (by "drop" I mean write nothing instead). Specifically, if I see no bounding box I am basically forced to believe that I know nothing about my data. If I see an empty box than I know I can safely skip this piece of data for certain operations (like spatial predicates).

Hope this makes sense.

mkaravel avatar Apr 18 '25 01:04 mkaravel

@mkaravel Thanks for the input! Let's discuss this topic in the format PR.

wgtmac avatar Apr 18 '25 06:04 wgtmac

@wgtmac @jiayuasu I have made some clean ups of this PR and added some additional tests to cover the edge cases particularly for the invalid bbox scenarios.

I added a new test cases in TestTypeBuilders which verifies the two new GEOMETRY/GEOGRAPHY types creation.

Also, I added a comprehensive test testMergingRowGroupBoundingBoxes which simulates a real-world scenario provided in the spec clarification pr (https://github.com/apache/parquet-format/pull/494) where a file consists of multiple row groups with various geometric data, including valid points, points with NaN coordinates, and empty geometries.

The TestBoundingBox incudes extensive tests on merging and updating bbox with/without invalid cases.

I think it is now ready to review again, and please let me know if any changes need to be made. Thanks!

zhangfengcdt avatar Apr 25 '25 17:04 zhangfengcdt

PR of adding logical types has been merged. Please rebase this on top of the latest commit, thanks!

wgtmac avatar Apr 29 '25 02:04 wgtmac

PR of adding logical types has been merged. Please rebase this on top of the latest commit, thanks!

Thanks @wgtmac ! I have rebased and updated this PR. It should be ready for review now.

zhangfengcdt avatar Apr 30 '25 13:04 zhangfengcdt

Thanks! I will take a look after back from vacation.

wgtmac avatar Apr 30 '25 15:04 wgtmac

@wgtmac can you review again? Thank you!

jiayuasu avatar May 13 '25 07:05 jiayuasu

@wgtmac Thank you for the very helpful review comments! I believe I’ve addressed them all and have also added and updated some tests. Could you please take another look and let me know if I missed anything or if further changes are needed? Thanks again!

zhangfengcdt avatar May 14 '25 14:05 zhangfengcdt

Thanks for the quick update! Now it generally looks great.

Thanks for the review! The fixes should be all in now.

zhangfengcdt avatar May 16 '25 04:05 zhangfengcdt

I just merged it. Thanks @zhangfengcdt for working on this and everyone for the review!

wgtmac avatar May 18 '25 15:05 wgtmac