kafka-operator icon indicating copy to clipboard operation
kafka-operator copied to clipboard

Kafka 4.x

Open razvan opened this issue 6 months ago • 5 comments

Which new version of Apache Kafka should we support?

Kafka 4 has been released. This is the first release that operates entirely without ZooKeeper and running KRaft by default.

KRaft is officially available since release Kafka 3.9.

This means a new (Kafka-) role must be introduced to replace the external ZooKeeper. A powerful new consumer group protocol designed to dramatically improve rebalance performance is introduced to significantly reduce downtime and latency. Java versions were updated to 11 and 17 respectively.

Release notes: https://archive.apache.org/dist/kafka/4.0.0/RELEASE_NOTES.html

Docker image

  • https://github.com/stackabletech/docker-images/pull/1117

Current Status

  • https://github.com/stackabletech/kafka-operator/issues/690
  • https://github.com/stackabletech/kafka-operator/issues/876
  • https://github.com/stackabletech/kafka-operator/issues/872

Next

  • [x] https://github.com/stackabletech/demos/issues/232
    • demo docs need to be rewritten to showcase the usage of Kafka but without kcat
  • [x] Replace kcat with Kafka client scripts
    • affects the kafka operator, tests and demo documentation.
    • also implement this https://github.com/stackabletech/issues/issues/768
  • [x] GracefulShutdown improvements: Currently Prestop sleep hook is used in the Controller to provide brokers more time to off load when shutting down the cluster. This is a beta feature until Kubernetes 1.34 and must be replaced since we do not want to use beta features. We want to do this timeboxed (4h) if e.g. autodetection of the Kubernetes version / Endpoint to request features is possible and we switch from Prestop hook to a different implementation.
  • [x] Improve AntiAffinities controller / broker to ensure they are on different nodes?
    • @razvan: Currently the anti affinity rules ensure that brokers are spread out as much as possible. Same for controllers. To also separate controllers from brokers, taints and tolerations are probably the better mechanism because it allows nodes to be provisioned accordingly. For example, broker nodes could require more resources than controllers.
  • [x] Liveness / Readiness (controller): Currently TCPProbe, improve via (e.g. check if quorum joinend?)
    • @razvan: An alternative to the tcp probe would either have to use a lightweight process like kcat or an HTTP endpoint.
    • kcat doesn't support Kraft controllers
    • the kafka rest proxy cannot be used because of the license restrictions
  • [x] Improve PDBs for broker (currently 1) or controller (currently 1)?
    • @razvan: Leave as is for now.
  • ~3.7.2 no dynamic quorum (bad for scaling) https://developers.redhat.com/articles/2024/11/27/dynamic-kafka-controller-quorum; documented here, do we want to suppress / warn within the operator?~
  • ~Discovery (currently just host:port combinations exposed for brokers, no other connection details (TLS))~

Next 2

The following issues are only partially (or not at all) implemented and tested.

  • Kraft controller authorization (opa)
  • Kraft controller down-scaling and/or shutdown
  • Kerberized Kraft controllers
  • ZooKeeper - KRaft migration (manual guide, half automated, full automated)
    • https://strimzi.io/blog/2024/03/21/kraft-migration/
    • https://docs.confluent.io/platform/current/installation/migrate-zk-kraft.html
    • https://kafka.apache.org/documentation.html#upgrade
    • Try out manual migration from Zk to Kraft to have an answer to it

razvan avatar Jun 18 '25 09:06 razvan

Maxi and Malte to look at this - @razvan to dig up previous notes. Requires refinement first to decide the scope of what/how we want to support Kafka 4.

lfrancke avatar Jun 24 '25 07:06 lfrancke

Previous notes: https://github.com/stackabletech/kafka-operator/issues/690

razvan avatar Jun 24 '25 07:06 razvan

@maltesander will refine the next steps (maybe pull in others)

sbernauer avatar Oct 06 '25 07:10 sbernauer

We park it for now until @lfrancke talks to @razvan and/or @maltesander to decide on the next steps

sbernauer avatar Oct 29 '25 08:10 sbernauer

We will create a new issue for things that are still open for Kafka 4 and close this afterwards

sbernauer avatar Nov 10 '25 08:11 sbernauer