Kafka 4.x
Which new version of Apache Kafka should we support?
Kafka 4 has been released. This is the first release that operates entirely without ZooKeeper and running KRaft by default.
KRaft is officially available since release Kafka 3.9.
This means a new (Kafka-) role must be introduced to replace the external ZooKeeper. A powerful new consumer group protocol designed to dramatically improve rebalance performance is introduced to significantly reduce downtime and latency. Java versions were updated to 11 and 17 respectively.
Release notes: https://archive.apache.org/dist/kafka/4.0.0/RELEASE_NOTES.html
Docker image
- https://github.com/stackabletech/docker-images/pull/1117
Current Status
- https://github.com/stackabletech/kafka-operator/issues/690
- https://github.com/stackabletech/kafka-operator/issues/876
- https://github.com/stackabletech/kafka-operator/issues/872
Next
- [x] https://github.com/stackabletech/demos/issues/232
- demo docs need to be rewritten to showcase the usage of Kafka but without
kcat
- demo docs need to be rewritten to showcase the usage of Kafka but without
- [x] Replace
kcatwith Kafka client scripts- affects the kafka operator, tests and demo documentation.
- also implement this https://github.com/stackabletech/issues/issues/768
- [x] GracefulShutdown improvements: Currently Prestop sleep hook is used in the Controller to provide brokers more time to off load when shutting down the cluster. This is a beta feature until Kubernetes 1.34 and must be replaced since we do not want to use beta features. We want to do this timeboxed (4h) if e.g. autodetection of the Kubernetes version / Endpoint to request features is possible and we switch from Prestop hook to a different implementation.
- [x] Improve
AntiAffinitiescontroller / broker to ensure they are on different nodes?- @razvan: Currently the anti affinity rules ensure that brokers are spread out as much as possible. Same for controllers. To also separate controllers from brokers, taints and tolerations are probably the better mechanism because it allows nodes to be provisioned accordingly. For example, broker nodes could require more resources than controllers.
- [x] Liveness / Readiness (controller): Currently TCPProbe, improve via (e.g. check if quorum joinend?)
- @razvan: An alternative to the tcp probe would either have to use a lightweight process like
kcator an HTTP endpoint. kcatdoesn't support Kraft controllers- the kafka rest proxy cannot be used because of the license restrictions
- @razvan: An alternative to the tcp probe would either have to use a lightweight process like
- [x] Improve
PDBs for broker (currently 1) or controller (currently 1)?- @razvan: Leave as is for now.
- ~3.7.2 no dynamic quorum (bad for scaling) https://developers.redhat.com/articles/2024/11/27/dynamic-kafka-controller-quorum; documented here, do we want to suppress / warn within the operator?~
- ~Discovery (currently just
host:portcombinations exposed for brokers, no other connection details (TLS))~
Next 2
The following issues are only partially (or not at all) implemented and tested.
- Kraft controller authorization (opa)
- Kraft controller down-scaling and/or shutdown
- Kerberized Kraft controllers
- ZooKeeper - KRaft migration (manual guide, half automated, full automated)
- https://strimzi.io/blog/2024/03/21/kraft-migration/
- https://docs.confluent.io/platform/current/installation/migrate-zk-kraft.html
- https://kafka.apache.org/documentation.html#upgrade
- Try out manual migration from Zk to Kraft to have an answer to it
Maxi and Malte to look at this - @razvan to dig up previous notes. Requires refinement first to decide the scope of what/how we want to support Kafka 4.
Previous notes: https://github.com/stackabletech/kafka-operator/issues/690
@maltesander will refine the next steps (maybe pull in others)
We park it for now until @lfrancke talks to @razvan and/or @maltesander to decide on the next steps
We will create a new issue for things that are still open for Kafka 4 and close this afterwards