flink
flink copied to clipboard
[FLINK-33759] [flink-parquet] Add support for nested array with row/map/array type
What is the purpose of the change
This Pull request adds commits from https://github.com/apache/flink/pull/24029 and https://github.com/apache/flink/pull/23881 to fully resolve issue https://issues.apache.org/jira/browse/FLINK-33759
Brief change log
Adds support for ParquetWriter nested Array, Map and Row Types. Previously the writer just wrote nothing. Silently failing.
Verifying this change
Please make sure both new and modified tests in this PR follow the conventions for tests defined in our code quality guide.
This change added tests and can be verified as follows:
- Added test to parquet writer using compression with all nested possible types.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: no - The serializers: no
- The runtime per-record code paths (performance sensitive): don't think so, previously the writer would instantly return for these cases but it would write nothing.
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? no, it's a fix
- If yes, how is the feature documented? na
CI report:
- 2cdec63ad44a568b9d48ab90b3924501c1eed78c Azure: SUCCESS
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
Thanks! The PR looks good to me. Hi @JingsongLi, please also take a look. I think we can port this to fix apache/paimon#1730
Hey @xccui, thanks for the quick review! Just checking in to keep this PR alive. Is there anything I can do to help get this merged?
Do you have any hints about the background info?
Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.
After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process
Do you have any hints about the background info?
Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty.
After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process
Also, as @xccui mentioned, there is a very similar issue/code with methods missing on apache-paimon, so maybe this code was copied around a bit without these methods
It looks like those methods were skipped on purpose in #17542 and https://github.com/apache/flink/pull/17542#issuecomment-1954552466
I see, Ive got the impression that they were forgotten, not purposely left out because no tests covered writing nested structures ( if there were, the tests would fail and the methods would have been implemented before)
Did you got the impression it wasn’t added on purpose? I could add more tests if you think it would help
I talked with @JingsongLi offline. It seems that at that time there was no such requirement and therefore skipped those implementation.
@ViktorCosenza would you please squash your commits and rebase it? Once it passed the CI, I will merge the PR. Thanks!
Do you have any hints about the background info?
Not really, I found out this issue because we were trying to save Parquet files to S3 and the writer would fail due to empty array being invalid in Parquet. When stepping in the debugger I realized that the methods were empty. After digging a bit i found 2 PRs here that jointly solved the problem, but both inactive. I simply created this PR merging both of them to facilitate the review/merge process
Also, as @xccui mentioned, there is a very similar issue/code with methods missing on apache-paimon, so maybe this code was copied around a bit without these methods
Thanks for the hint! Would you mind if I picked up some of your code to paimon?
@flinkbot run azure
@ViktorCosenza would you please squash your commits and rebase it? Once it passed the CI, I will merge the PR. Thanks!
Done! But I wasn't able to trigger the CI
Thanks for the hint! Would you mind if I picked up some of your code to paimon?
Sure, go ahead!
@JingGe Are you able to trigger the CI manually? I think I't wasnt triggered after the squash because no changes were detected.
@flinkbot run azure
@ViktorCosenza did you rebase?
@ViktorCosenza did you rebase?
Yes, i did an interactive rebase and then force-pushed the squashed commit
@JingGe CI Passed, I dindn't realize the bot just editted the same comment