flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-3154][runtime] Upgrade from Kryo v2 + Chill 0.7.6 to Kryo v5 w…

Open kurtostfeld opened this issue 2 years ago • 6 comments

…ith backward compatibility for existing savepoints and checkpoints.

What is the purpose of the change

To upgrade the primary Kryo library used by Flink from v2.x to v5.x, while providing backwards compatibility with existing savepoints and checkpoints. This PR adds a new Kryo v5 dependency that is namespaced so that it can coexist with the legacy dependencies that would be kept for compatibility purposes. Flink also depends on the Twitter chill (Scala) and chill-java libraries for additions + enhancements to Kryo v2.x. This would also be deprecated as most functionality is included in Kryo v5.x. Some future version of Flink could eventually drop Kryo v2 and the Twitter chill dependencies when backwards compatibility with Kryo v2 based state is no longer needed.

Why upgrade Kryo? One reason is support for Java 17 and 21. The existing Kryo 2.x is not compatible with Java 17/21, while Kryo 5.x is. When running in a JDK 17 runtime, I notice that ArraysAsListSerializer from chill-java fails under Java 17. Fixes can be back ported for that relatively easily, but there are more issues. Kryo 2.x doesn't support Java records at all.

A more broader reason is Flink should be using the new, actively maintained version of Kryo rather than the 10+ year old 2.x branch that stopped getting any updates almost ten years ago. Kryo v2.x was released before the release of Java 8. So it's quite old. Kryo is a maintained project, there have been lots of improvements over the past ten years. Kryo 5.x has faster runtime performance, more memory efficient serialization, fixed lots of bugs, added functionality, and improved compatibility with newer versions of Java. Kryo 5.x will also get more improvements in the future that will be fully compatible with existing Kryo 5.x serialized data.

Brief change log

This is a large PR with a lot of surface area and risk. I tried to keep the scope of these changes as narrow and simple as possible. I copied all the Kryo v2 code to Kryo v5 equivalents and made necessary adjustments to get everything working. Some highlights as to what was done:

All existing serialization class names and package names are unmodified

Some Flink serialization code references serialization classes by full package name, so the package and class names of all existing serialization classes are unmodified.

Added new version 7 to KeyedBackendSerializationProxy

In production Flink, version 6 is the current version. This PR adds version 7. The only difference between 6 and 7 is the Kryo upgrade. With version 6 serialized data, all Kryo state is Kryo 2.x. With version 7 serialized data, all Kryo state is Kryo 5.x

Added deserializeWithKeyedBackendVersion to TypeSerializer<T>

By default, this just calls the regular deserialize method. The Kryo 5.x version of Flink KryoSerializer will check the version number. For versions older than 7, this will call the Kryo 2.x version of the Flink KryoSerializer class.

Kryo 2.x Code Is not used outside of backwards compatibility scenarios

New serialized state will not be written with Kryo 2.x code. Some unit tests are still using the Kryo 2.x code, but the main code base is only using Kryo 2.x code for reading legacy state.

Verifying this change

This is passing the full test suite of automated tests in the CI system which covers lots of backwards compatibility scenarios.

Additionally, I wrote a Flink application to do a more thorough test of the Kryo upgrade that was difficult to convert into unit test form.

https://github.com/kurtostfeld/flink-kryo-upgrade-demo

If the Flink project is seriously considering accepting this PR, I plan to write more test scenarios for thorough backwards compatibility.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): yes. This adds a new Kryo v5 dependency and keeps legacy dependencies for backwards compatibility purposes.
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes. Kryo v2 APIs are deprecated. Parallel Kryo v5 APIs are created with PublicEvolving
  • The serializers: yes. This absolutely affects the serializers.
  • The runtime per-record code paths (performance sensitive): yes. This should be faster but I haven't done any benchmark testing.

kurtostfeld avatar May 25 '23 16:05 kurtostfeld

CI report:

  • dd97bd57698c816840915b83a49889722561b273 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar May 25 '23 17:05 flinkbot

I appreciate the work, but really this change has to go through a proper design process.

There are open questions as to whether we really want to expose Kryo in the same way we did before (because the presence of Kryo serializers in the execution config is a problem in general).

Please create a FLIP and publish it on the dev mailing list.

zentol avatar May 26 '23 10:05 zentol

@zentol ok, I created a Confluence login with username "kurto", however I don't seem to have permissions to create a new FLIP. I don't see a "Create" button as the docs say:

To create your own FLIP, click on "Create" on the header and choose "FLIP-Template" other than "Blank page".

Could I get permissions to do this? Thank you :)

kurtostfeld avatar May 26 '23 17:05 kurtostfeld

Could I get permissions to do this? Thank you :)

You should have permissions by now

MartijnVisser avatar May 30 '23 07:05 MartijnVisser

FYI, there is a pull request in the works to upgrade Twitter Chill to use Kryo 5.5.0. I don't think this pull request is blocked by the Chill pull request, but might be relevant, so I'm sharing the link. https://github.com/twitter/chill/pull/747

nicknezis avatar Jul 17 '23 23:07 nicknezis

I added my comment on the Jira ticket, but also posting it here for better visibility.

I wanted to revisit the status of this ticket. There was some discussion in the mailing list about merging this into Flink 2.0. If the master branch now is targeting 2.0, maybe we can finally move forward with this FLIP?

nicknezis avatar Aug 26 '24 14:08 nicknezis