arrow icon indicating copy to clipboard operation
arrow copied to clipboard

GH-41764: [Parquet][C++] Support future logical types in the Parquet reader

Open paleolimbot opened this issue 1 year ago • 2 comments

Rationale for this change

If new logical types are added to Parquet, how should they be handled? (Or is it easy enough to just support them as soon as they are added to the spec?)

What changes are included in this PR?

Experimental ones!

Are these changes tested?

Absolutely not!

Are there any user-facing changes?

Not yet!

  • GitHub Issue: #41764

paleolimbot avatar May 21 '24 20:05 paleolimbot

:warning: GitHub issue #41764 has been automatically assigned in GitHub to PR creator.

github-actions[bot] avatar May 21 '24 20:05 github-actions[bot]

Just ping me if this is ready

mapleFU avatar Jul 03 '24 15:07 mapleFU

Just ping me if this is ready

Thanks! I think the missing piece is an actual end-to-end test with a real Parquet file (and I haven't had time to circle back to the suggestion of hacking the generated thrift for an actual written-to-disk Parquet file).

paleolimbot avatar Jul 04 '24 07:07 paleolimbot

I'm just noticing this PR. Is there a reason to make this behavior optional instead of standard?

pitrou avatar Nov 18 '24 15:11 pitrou

I'm just noticing this PR. Is there a reason to make this behavior optional instead of standard?

I don't have an opinion here, although I think that the arrow.opaque (vendor: Parquet, type: Unknown Logical type with ID XXX) extension type makes it easier to make the case for having this be the default. Without that, there is a possibly good argument for erroring to avoid silently discarding logical type information.

I ran out of time figuring out how to generate a Parquet file I could use to test the Python bindings here, but I'm happy to resurrect this PR and give it another shot if there is interest.

paleolimbot avatar Nov 18 '24 17:11 paleolimbot

Apologies for taking a while to circle back here...at the time I started this PR I didn't have a great grasp on how all the pieces of the Parquet testing went together and it all floated back in today. I made https://github.com/apache/parquet-testing/pull/72 to ensure we have a file to test against, have tested it against these changes, and I can add update the submodule + add tests here if this/the PR into parquet-testing is something the community is interested in.

Given that there are three new types forthcoming in Parquet (and I'm not sure the Geometry/Geography implementation will make it in before the April 1 feature freeze), I think this would be great to get in to the next version if there's review bandwidth!

Is there a reason to make this behavior optional instead of standard?

I'm happy to do either...I personally think that making the "ignore unknown logical types" the default behaviour would be best.

paleolimbot avatar Mar 19 '25 22:03 paleolimbot

I'm just noticing this PR. Is there a reason to make this behavior optional instead of standard?

Before proceeding, I think we need to agree on this.

wgtmac avatar Mar 26 '25 14:03 wgtmac

I just rebased to resolve the submodule conflict. If there are no objections, I'll merge later today!

paleolimbot avatar Mar 31 '25 15:03 paleolimbot

Well, we haven't decided if we wanted to make this the default. I'm not sure it's worth maintaining an option for this.

pitrou avatar Mar 31 '25 15:03 pitrou

Great point! I don't mind either way but I do think this is important to merge some version of this before the feature freeze given the forthcoming type additions. Keeping the option is more explicit and the error message hints at how to make it go away but has maintenance cost, as you noted.

Scanning the comments, it seems like there is more enthusiasm for having the "read as physical type" to be the default benavhiour and no objection to the contrary. I'll make that change later today unless there's some objection!

paleolimbot avatar Mar 31 '25 16:03 paleolimbot

Just a nit, otherwise looks good; thank you @paleolimbot !

pitrou avatar Apr 01 '25 16:04 pitrou

Thank you all for the reviews here!

paleolimbot avatar Apr 01 '25 17:04 paleolimbot

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 2ba7f13df77b86f000e1afd9645c73060e7f511b.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 18 possible false positives for unstable benchmarks that are known to sometimes produce them.