flink [FLINK-33611] [flink-protobuf] Support Large Protobuf Schemas

What is the purpose of the change

Background

Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size limit and we can see errors like "Too many constants" when trying to compile the generated code. We have run into these issues while doing a POC with some of our real production schemas at UBER

Solution

Split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods and thus exceed the constant pool size limit

Implementation

Add check for split size before splitting last segment

Brief change log

[FLINK-33611] [flink-protobuf] Add VeryBigProtobuf testcases

[FLINK-33611] [flink-protobuf] Split last segment only when size exceeds split threshold limit in serializer

[FLINK-33611] [flink-protobuf] Split last segment only when size exceeds split threshold limit in deserializer

Verifying this change

This change is already covered by existing tests, such as existing Unit Tests and Integration tests for Protobuf format Including the BigProtoBufCodeSpiltterTest. Additionally, I have added a new test with very big protobuf schemas which would fail without the optimizations made in this pull request

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): No
The public API, i.e., is any changed class annotated with @Public(Evolving): No
The serializers: No
The runtime per-record code paths (performance sensitive): No
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: No
The S3 file system connector: No

Documentation

Does this pull request introduce a new feature? Yes
If yes, how is the feature documented? Not Applicable as the feature is fully transparent to users

Dec 15 '23 22:12 sharath1709

CI report:

cc57c9c6c55262cd4d6862cce5a1cb76151aba75 Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Dec 15 '23 22:12 flinkbot

@libenchao Can you please review this PR? I see you're one of the main reviewer for this module in the past

Dec 20 '23 23:12 sharath1709

@sharath1709 Thanks for your contribution. Per the community's convention, we need to reach an agreement on the issue before we go to the PR review. I'll comment on the corresponding Jira, and let's get back here after we reach the agreement.

Dec 21 '23 12:12 libenchao

@libenchao Thanks for the quick response. Sure, let's try to reach an agreement on the JIRA first

Dec 21 '23 19:12 sharath1709

@sharath1709 sorry for the delay, I haven't checked the code/design yet, but can you add a test case that fails without this improvement?

Jan 02 '24 11:01 libenchao

@flinkbot run azure

Jan 08 '24 22:01 sharath1709

@flinkbot run azure

Jan 09 '24 00:01 sharath1709

@flinkbot run azure

Jan 09 '24 21:01 sharath1709

@libenchao Could you review this PR again, it is updated as per our discussion on the JIRA ticket and I also added a test case that fails without the changes in this PR. Also, the failing python test is unrelated to this diff and seems to be flaky.

Jan 10 '24 21:01 sharath1709

@libenchao Gentle ping on the review

Jan 29 '24 20:01 sharath1709