flink
flink copied to clipboard
[FLINK-33817][flink-protobuf] Set ReadDefaultValues=False by default in proto3 for performance improvement
What is the purpose of the change
Background
The current Protobuf format implementation always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version 3.15. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance for one of our real production usecase when allowing to set ReadDefaultValues=False with proto3 version. The exact difference in performance depends on the schema complexity and data payload but we should allow user to set readDefaultValue=False in general.
Solution
Set Default value for ReadDefaultValues=False for both proto3 and proto2 versions. Add documentation for users to set this value to true when using older protobuf versions before 3.15 and the performance implications. Additionally, We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3.
Brief change log
- [FLINK-33817][flink-protobuf] update protobuf docs about readDefaultValues
- [FLINK-33817][flink-protobuf] Set ReadDefaultValues=False by default in Proto3 for performance improvement
Verifying this change
This change is already covered by existing unit tests and e2e tests for the Protobuf format for correctness. The performance improvement is benchmarked on testing clusters in my company.
Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): no
- The public API, i.e., is any changed class annotated with
@Public(Evolving)
: no - The serializers: no
- The runtime per-record code paths (performance sensitive): no
- Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
- The S3 file system connector: no
Documentation
- Does this pull request introduce a new feature? no
- If yes, how is the feature documented? Both JavaDocs and flink docs are updated
CI report:
- f4d21293152ca2b5335ddac82329f6be8407d162 Azure: FAILURE
Bot commands
The @flinkbot bot supports the following commands:-
@flinkbot run azure
re-run the last Azure build
@flinkbot run azure
I'd like to upvote this PR since these changes also help in my use case: at the moment I can't distinguish one message from another in OneOf
field since there're a lot of default values. So the ability to filter fields with nulls is crucial.
I'm really looking forward to have this merged.
@wckdman Thanks for the upvote. I'm looking forward to getting this PR merged as well after getting the approval from the reviewers
I just tested this on one of my clients' datasets which uses a lot of oneof
fields and it fixed the same problem that wckdman mentioned.
@libenchao @maosuhan Gentle ping on the PR review
@sharath1709 Thanks for the contribution. I'll also help to look into this PR also and I expect to finish it this week. cc @libenchao
LGTM now
What is the purpose of the change
Background
The current Protobuf format implementation always sets ReadDefaultValues=False when using Proto3 version.
@sharath1709 I think the description should be "ProtoToRowConverter always sets ReadDefaultValues=true when using Proto3 version. "
What is the purpose of the change
Background The current Protobuf format implementation always sets ReadDefaultValues=False when using Proto3 version.
@sharath1709 I think the description should be "ProtoToRowConverter always sets ReadDefaultValues=true when using Proto3 version. "
Thanks, updated!
thanks for addressing comments, lgtm
@maosuhan @snuyanzin Thanks for the review. Since you both have approved the PR, I'm merging now.