flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-33817][flink-protobuf] Set ReadDefaultValues=False by default in proto3 for performance improvement

Open sharath1709 opened this issue 1 year ago • 2 comments

What is the purpose of the change

Background

The current Protobuf format implementation always sets ReadDefaultValues=False when using Proto3 version. This can cause severe performance degradation for large Protobuf schemas with OneOf fields as the entire generated code needs to be executed during deserialization even when certain fields are not present in the data to be deserialized and all the subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for checking field presence for non primitive types since Proto version 3.15. In the internal performance benchmarks in our company, we've seen almost 10x difference in performance for one of our real production usecase when allowing to set ReadDefaultValues=False with proto3 version. The exact difference in performance depends on the schema complexity and data payload but we should allow user to set readDefaultValue=False in general.

Solution

Set Default value for ReadDefaultValues=False for both proto3 and proto2 versions. Add documentation for users to set this value to true when using older protobuf versions before 3.15 and the performance implications. Additionally, We need to be careful to check for field presence only on non-primitive types if ReadDefaultValues is false and version used is Proto3.

Brief change log

  • [FLINK-33817][flink-protobuf] update protobuf docs about readDefaultValues
  • [FLINK-33817][flink-protobuf] Set ReadDefaultValues=False by default in Proto3 for performance improvement

Verifying this change

This change is already covered by existing unit tests and e2e tests for the Protobuf format for correctness. The performance improvement is benchmarked on testing clusters in my company.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? Both JavaDocs and flink docs are updated

sharath1709 avatar Jan 06 '24 00:01 sharath1709

CI report:

  • f4d21293152ca2b5335ddac82329f6be8407d162 Azure: FAILURE
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jan 06 '24 00:01 flinkbot

@flinkbot run azure

sharath1709 avatar Jan 09 '24 21:01 sharath1709

I'd like to upvote this PR since these changes also help in my use case: at the moment I can't distinguish one message from another in OneOf field since there're a lot of default values. So the ability to filter fields with nulls is crucial.

I'm really looking forward to have this merged.

wckdman avatar Feb 01 '24 09:02 wckdman

@wckdman Thanks for the upvote. I'm looking forward to getting this PR merged as well after getting the approval from the reviewers

sharath1709 avatar Feb 06 '24 21:02 sharath1709

I just tested this on one of my clients' datasets which uses a lot of oneof fields and it fixed the same problem that wckdman mentioned.

nathantalewis avatar Feb 15 '24 15:02 nathantalewis

@libenchao @maosuhan Gentle ping on the PR review

sharath1709 avatar Feb 20 '24 19:02 sharath1709

@sharath1709 Thanks for the contribution. I'll also help to look into this PR also and I expect to finish it this week. cc @libenchao

maosuhan avatar Feb 22 '24 14:02 maosuhan

LGTM now

maosuhan avatar Feb 24 '24 11:02 maosuhan

What is the purpose of the change

Background

The current Protobuf format implementation always sets ReadDefaultValues=False when using Proto3 version.

@sharath1709 I think the description should be "ProtoToRowConverter always sets ReadDefaultValues=true when using Proto3 version. "

maosuhan avatar Feb 25 '24 04:02 maosuhan

What is the purpose of the change

Background The current Protobuf format implementation always sets ReadDefaultValues=False when using Proto3 version.

@sharath1709 I think the description should be "ProtoToRowConverter always sets ReadDefaultValues=true when using Proto3 version. "

Thanks, updated!

sharath1709 avatar Feb 25 '24 18:02 sharath1709

thanks for addressing comments, lgtm

snuyanzin avatar Feb 25 '24 22:02 snuyanzin

@maosuhan @snuyanzin Thanks for the review. Since you both have approved the PR, I'm merging now.

libenchao avatar Feb 26 '24 03:02 libenchao