parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-2042: Add support for unwrapping common Protobuf wrappers and…

Open mwong38 opened this issue 3 years ago • 15 comments

… logical Timestamps, Date, TimeOfDay

Make sure you have checked all steps below.

Jira

  • [X] My PR addresses the following Parquet Jira issues and references them in the PR title. For example, "PARQUET-1234: My Parquet PR"
    • https://issues.apache.org/jira/browse/PARQUET-XXX
    • In case you are adding a dependency, check if the license complies with the ASF 3rd Party License Policy.

Tests

  • [X] My PR adds the following unit tests OR does not need testing for this extremely good reason:

Commits

  • [X] My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • [X] In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

mwong38 avatar May 03 '21 07:05 mwong38

@mwong38, can you put more information in the Jira on why/what is changed? This is pretty big change and it would help people to review your code.

shangxinli avatar Nov 12 '21 17:11 shangxinli

@shangxinli @mwong38 any way I can help with merge? Currently, proto-to-parquet conversions do not support proper timestamp handling, and it looks like this work addresses the issue.

Let me know.

sheinbergon avatar Mar 05 '22 12:03 sheinbergon

After proto3 made everything optional, there is no way to know whether a primitive has been set or not. That is, you could no longer represent a "nullable" primitive. (They later brought optional keyword back, but the damage is done). The solution was to wrap a primitive in a message. For example, to represent a nullable double, there is the DoubleValue. It's analogous to Java's boxed Double.

Parquet supports nullable primitives natively. If we were to represent these well-known Protobuf types directly, it will be nested inside a deeper data structure; very inconvenient and a waste of space. What I did here is "unwrap" these primitives and make use of Parquet's nullaibility.

Additionally, I convert other well-known types such as Timestamp and Date to Parquet's. Again, rather than represent the data structure in its raw nested form (seconds/nanos or year, month, day, etc), they are converted to Parquet's native or logical representation of Timestamp and Date. You can turn on or off this features in the ProtoSchemaConverter constructor.

mwong38 avatar Mar 06 '22 08:03 mwong38

@shangxinli is there any reason this work shouldn't be merged, given @mwong38 explanation? Is there something I can help with?

sheinbergon avatar Mar 06 '22 14:03 sheinbergon

I don't see a reason why this cannot be merge. I will have a look soon.

shangxinli avatar Mar 07 '22 16:03 shangxinli

Can you squash all the commits?

shangxinli avatar Mar 19 '22 20:03 shangxinli

@mwong38 let me know if you want me to help in any way

sheinbergon avatar Mar 21 '22 10:03 sheinbergon

I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?

@belugabehr might have thoughts too.

dossett avatar Mar 22 '22 13:03 dossett

I am not an expert here by any means, but I have made (very small) changes related to default values in protobuf parquet before and the goal of this change seem good. I assume that the behavior for unwrapped primitives remains unchanged?

@belugabehr might have thoughts too.

dossett avatar Mar 22 '22 13:03 dossett

@shangxinli any news about merging this version? Are there still any blockers?

sheinbergon avatar Mar 25 '22 13:03 sheinbergon

Will have another look soon.

shangxinli avatar Mar 25 '22 19:03 shangxinli

@mwong38 Can you address the feedback from @emkornfield before we can merge?

shangxinli avatar Apr 20 '22 15:04 shangxinli

@mwong38 anyway I can help with finalizing this PR?

sheinbergon avatar Jun 13 '22 07:06 sheinbergon

I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.

shangxinli avatar Jul 24 '22 20:07 shangxinli

I think we are close to merge this PR. Resolve the conflict and use the imports , then we can merge.

Done

mwong38 avatar Sep 04 '22 11:09 mwong38

@shangxinli any reason why this PR hasn't been merged yet?

sheinbergon avatar Oct 17 '22 22:10 sheinbergon

any reason why this PR hasn't been merged yet?

zhaolihe avatar Oct 17 '23 10:10 zhaolihe

I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!

wgtmac avatar Oct 18 '23 09:10 wgtmac

I just noticed this PR and sorry to see it does not check in. @mwong38 Could you try rebase it one last time? Thanks!

Done. I really hope it's the last time.

mwong38 avatar Nov 26 '23 23:11 mwong38

Merged

shangxinli avatar Nov 27 '23 01:11 shangxinli