[FLINK-33611] [flink-protobuf] Support Large Protobuf Schemas
What is the purpose of the change
Background
Flink serializes and deserializes protobuf format data by calling the decode or encode method in GeneratedProtoToRow_XXX.java generated by codegen to parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the ability to split the generated code to improve the performance for large Protobuf schemas. However, this is still not sufficient to support some larger protobuf schemas as the generated code exceeds the java constant pool size limit and we can see errors like "Too many constants" when trying to compile the generated code. We have run into these issues while doing a POC with some of our real production schemas at UBER
Solution
Split the last code segment also only when the size exceeds split threshold limit. Currently, the last segment of the generated code is always being split which can lead to too many split methods and thus exceed the constant pool size limit
Implementation
Add check for split size before splitting last segment
Brief change log
-
[FLINK-33611] [flink-protobuf] Add VeryBigProtobuf testcases -
[FLINK-33611] [flink-protobuf] Split last segment only when size exceeds split threshold limit in serializer -
[FLINK-33611] [flink-protobuf] Split last segment only when size exceeds split threshold limit in deserializer
Verifying this change
This change is already covered by existing tests, such as existing Unit Tests and Integration tests for Protobuf format Including the BigProtoBufCodeSpiltterTest. Additionally, I have added a new test with very big protobuf schemas which would fail without the optimizations made in this pull request
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): No
- The public API, i.e., is any changed class annotated with
@Public(Evolving): No - The serializers: No
- The runtime per-record code paths (performance sensitive): No
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: No
- The S3 file system connector: No
Documentation
- Does this pull request introduce a new feature? Yes
- If yes, how is the feature documented? Not Applicable as the feature is fully transparent to users
CI report:
- cc57c9c6c55262cd4d6862cce5a1cb76151aba75 Azure: FAILURE
Bot commands
The @flinkbot bot supports the following commands:@flinkbot run azurere-run the last Azure build
@libenchao Can you please review this PR? I see you're one of the main reviewer for this module in the past
@sharath1709 Thanks for your contribution. Per the community's convention, we need to reach an agreement on the issue before we go to the PR review. I'll comment on the corresponding Jira, and let's get back here after we reach the agreement.
@libenchao Thanks for the quick response. Sure, let's try to reach an agreement on the JIRA first
@sharath1709 sorry for the delay, I haven't checked the code/design yet, but can you add a test case that fails without this improvement?
@flinkbot run azure
@flinkbot run azure
@flinkbot run azure
@libenchao Could you review this PR again, it is updated as per our discussion on the JIRA ticket and I also added a test case that fails without the changes in this PR. Also, the failing python test is unrelated to this diff and seems to be flaky.
@libenchao Gentle ping on the review