flink icon indicating copy to clipboard operation
flink copied to clipboard

FLINK-35641 ParquetSchemaConverter supports required fields

Open Gerrrr opened this issue 1 year ago • 2 comments

What is the purpose of the change

The purpose of this change is to fix 2 scenarios where Flink produces incorrect Parquet files.

Scenario 1: Convert nullable Flink types to optional Parquet types. Right now, Flink configures all Parquet types as optional, regardless of whether they are nullable or not. Scenario 2:.Ensures that the converter does not create invalid Parquet files with optional map keys. According to Parquet standard, map keys are required.

Brief change log

  • Configure non-nullable Flink types as required Parquet types.
  • Ensure that Flink map key type is non-nullable / required.
  • Ensure that Flink multiset element type is non-nullable / required.
  • Adjust existing tests and add new ones to cover the change in behavior.
  • Mention the nullable key limitation in Parquet format docs.

Verifying this change

  1. Adjusted existing tests.
  2. Added new test cases to cover invalid types.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: yes? This change affects the behavior of the file system connector when if it writes Parquet files.

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? Mentioned the unsupported nullable map keys in docs

Gerrrr avatar Jun 19 '24 01:06 Gerrrr

CI report:

  • 80b0bec76ee06bf8e48227d3a9f6c71ac0d3a8e6 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Jun 19 '24 02:06 flinkbot

Test failures are caused by this patch - https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=60364&view=logs&j=2e8cb2f7-b2d3-5c62-9c05-cd756d33a819&t=2dd510a3-5041-5201-6dc3-54d310f68906. I will adjust the test cases on Thursday.

Gerrrr avatar Jun 19 '24 16:06 Gerrrr