flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-35555] Make ListSerializer allow null values

Open X-czh opened this issue 1 year ago • 2 comments

What is the purpose of the change

Make ListSerializer allow null values. Background: see FLINK-35555.

Brief change log

  • Add a binary mask for marking null values in ListSerializer.
  • Update ListSerializerSnapshot to deal with state-compatibility.

Verifying this change

  • Add tests on lists with null values in ListSerializerTest.
  • Updated ListSerializerUpgradeTest to test state-compatibility.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
  • The S3 file system connector: (yes / no / don't know)

Documentation

  • Does this pull request introduce a new feature? (yes / no)
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

X-czh avatar Jun 07 '24 15:06 X-czh

CI report:

  • 0aadea592267fe33db71bb22912ea86bb43279a8 Azure: FAILURE
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jun 07 '24 15:06 flinkbot

@flinkbot run azure

X-czh avatar Jun 08 '24 03:06 X-czh

Hi @X-czh , thanks for this PR. It's overall LGTM, but I noticed some relating issues to this topic (built-in collection serializer) and this PR:

  1. MapSerializer does not support null key, which means it needs an enhancement like this as well.

  2. List, map and set serializers are all not support that the collection itself is null. This can be troublesome since some internal serializers of container types are not supporting null content, among which the most commonly used is StreamElementSerializer. To use these serializers as defaults safely, we have to: i) make these serializers support null collections, or ii) check all built-in serializers to ensure they properly handle null elements without invoking the value serializer.

What do you think about these issues?

pltbkd avatar Dec 25 '24 08:12 pltbkd

@pltbkd Thanks for the insightful comment!

  1. MapSerializer does not support null key, which means it needs an enhancement like this as well.

You are right, we'll need to extend MapSerializer as well.

  1. List, map and set serializers are all not support that the collection itself is null.

True. It's surprising that many other serializers do not support themselves being null as well. In production, this is not a major problem since user will package all fields into a POJO class in most cases, and the PoJo serializer handles null fields. But you are right, to be safe, we'd better deal with these cases properly for collection types. It might be easier to do so by introducing a set of new nullable typeinfos that accepts null values whose serializer can be easily constructed by wrapping the original Map/List/SetSerializer with NullableSerializer. WDYT?

X-czh avatar Jan 03 '25 15:01 X-czh

After offline discussion with @reswqa, we think a better way is to create a set of new nullable TypeInformations whose serializer is a simple wrap of the existing serializers with NullableSerializer to support values.

Rationale: the current serializers are already adopted by Flink internal state API where null collection and null elements are explicitly not supported. Extending them to support null values will add extra null masks, leading to extra serialization and storage cost of Flink state. Extra state migration code must be added as well.

I'll close this MR, and prepare a new one with the new approach.

X-czh avatar Jan 06 '25 09:01 X-czh