parquet-java
parquet-java copied to clipboard
PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…
… logical Timestamps, Date, TimeOfDay
Make sure you have checked all steps below.
Jira
- [X] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
- https://issues.apache.org/jira/browse/PARQUET-XXX
- In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.
Tests
- [X] My PR adds the following unit tests OR does not need testing for this extremely good reason:
Commits
- [X] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
- Subject is separated from body by a blank line
- Subject is limited to 50 characters (not including Jira issue reference)
- Subject does not end with a period
- Subject uses the imperative mood ("add", not "adding")
- Body wraps at 72 characters
- Body explains "what" and "why", not "how"
Documentation
- [X] In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does
@mwong38, can you put more information in the Jira on why/what is changed? This is pretty big change and it would help people to review your code.
@shangxinli @mwong38 any way I can help with merge? Currently, proto-to-parquet conversions do not support proper timestamp handling, and it looks like this work addresses the issue.
Let me know.
After proto3 made everything optional, there is no way to know whether a primitive has been set or not. That is, you could no longer represent a "nullable" primitive. (They later brought optional
keyword back, but the damage is done). The solution was to wrap a primitive in a message. For example, to represent a nullable double
, there is the DoubleValue. It's analogous to Java's boxed Double
.
Parquet supports nullable primitives natively. If we were to represent these well-known Protobuf types directly, it will be nested inside a deeper data structure; very inconvenient and a waste of space. What I did here is "unwrap" these primitives and make use of Parquet's nullaibility.
Additionally, I convert other well-known types such as Timestamp and Date to Parquet's. Again, rather than represent the data structure in its raw nested form (seconds/nanos or year, month, day, etc), they are converted to Parquet's native or logical representation of Timestamp and Date. You can turn on or off this features in the ProtoSchemaConverter
constructor.
@shangxinli is there any reason this work shouldn't be merged, given @mwong38 explanation? Is there something I can help with?
I don't see a reason why this cannot be merge. I will have a look soon.
Can you squash all the commits?
@mwong38 let me know if you want me to help in any way
I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?
@belugabehr might have thoughts too.
I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?
@belugabehr might have thoughts too.
@shangxinli any news about merging this version? Are there still any blockers?
Will have another look soon.
@mwong38 Can you address the feedback from @emkornfield before we can merge?
@mwong38 anyway I can help with finalizing this PR?
I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.
I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.
Done
@shangxinli any reason why this PR hasn't been merged yet?
any reason why this PR hasn't been merged yet?
I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!
I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!
Done. I really hope it's the last time.
Merged