flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33759] [flink-parquet] Add support for nested array with row/map/array type

Open ViktorCosenza opened this issue 9 months ago • 1 comments

What is the purpose of the change

This Pull request adds commits from https://github.com/apache/flink/pull/24029 and https://github.com/apache/flink/pull/23881 to fully resolve issue https://issues.apache.org/jira/browse/FLINK-33759

Brief change log

Adds support for ParquetWriter nested Array, Map and Row Types. Previously the writer just wrote nothing. Silently failing.

Verifying this change

Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.

This change added tests and can be verified as follows:

  • Added test to parquet writer using compression with all nested possible types.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): don't think so, previously the writer would instantly return for these cases but it would write nothing.
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no, it's a fix
  • If yes, how is the feature documented? na

ViktorCosenza avatar May 15 '24 22:05 ViktorCosenza

CI report:

  • 2cdec63ad44a568b9d48ab90b3924501c1eed78c Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar May 15 '24 22:05 flinkbot

Thanks! The PR looks good to me. Hi @JingsongLi, please also take a look. I think we can port this to fix apache/paimon#1730

Hey @xccui, thanks for the quick review! Just checking in to keep this PR alive. Is there anything I can do to help get this merged?

ViktorCosenza avatar May 22 '24 14:05 ViktorCosenza

Do you have any hints about the background info?

Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.

After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process

ViktorCosenza avatar May 22 '24 20:05 ViktorCosenza

Do you have any hints about the background info?

Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.

After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process

Also, as @xccui mentioned, there is a very similar issue/code with methods missing on apache-paimon, so maybe this code was copied around a bit without these methods

ViktorCosenza avatar May 22 '24 20:05 ViktorCosenza

It looks like those methods were skipped on purpose in #17542 and https://github.com/apache/flink/pull/17542#issuecomment-1954552466

JingGe avatar May 22 '24 20:05 JingGe

I see, Ive got the impression that they were forgotten, not purposely left out because no tests covered writing nested structures ( if there were, the tests would fail and the methods would have been implemented before)

Did you got the impression it wasn’t added on purpose? I could add more tests if you think it would help

ViktorCosenza avatar May 22 '24 20:05 ViktorCosenza

I talked with @JingsongLi offline. It seems that at that time there was no such requirement and therefore skipped those implementation.

JingGe avatar May 23 '24 07:05 JingGe

@ViktorCosenza would you please squash your commits and rebase it? Once it passed the CI, I will merge the PR. Thanks!

JingGe avatar May 23 '24 07:05 JingGe

Do you have any hints about the background info?

Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty. After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process

Also, as @xccui mentioned, there is a very similar issue/code with methods missing on apache-paimon, so maybe this code was copied around a bit without these methods

Thanks for the hint! Would you mind if I picked up some of your code to paimon?

JingGe avatar May 23 '24 07:05 JingGe

@flinkbot run azure

ViktorCosenza avatar May 23 '24 12:05 ViktorCosenza

@ViktorCosenza would you please squash your commits and rebase it? Once it passed the CI, I will merge the PR. Thanks!

Done! But I wasn't able to trigger the CI

ViktorCosenza avatar May 23 '24 12:05 ViktorCosenza

Thanks for the hint! Would you mind if I picked up some of your code to paimon?

Sure, go ahead!

ViktorCosenza avatar May 23 '24 12:05 ViktorCosenza

@JingGe Are you able to trigger the CI manually? I think I't wasnt triggered after the squash because no changes were detected.

ViktorCosenza avatar May 24 '24 12:05 ViktorCosenza

@flinkbot run azure

JingGe avatar May 24 '24 17:05 JingGe

@ViktorCosenza did you rebase?

JingGe avatar May 24 '24 17:05 JingGe

@ViktorCosenza did you rebase?

Yes, i did an interactive rebase and then force-pushed the squashed commit

ViktorCosenza avatar May 24 '24 19:05 ViktorCosenza

@JingGe CI Passed, I dindn't realize the bot just editted the same comment

ViktorCosenza avatar May 27 '24 12:05 ViktorCosenza