strimzi-kafka-operator ScaleUp and ScaleDown is not working with KRaft mode

ScaleUp and ScaleDown is not working with KRaft mode

Open Frawless opened this issue 2 years ago • 0 comments

Describe the bug Kafka scaleUp from 3 pods to 4 is not working with the following error:

STRIMZI_BROKER_ID=0
Preparing truststore for replication listener
Adding /opt/kafka/cluster-ca-certs/ca.crt to truststore /tmp/kafka/cluster.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for replication listener is complete
Looking for the right CA
Found the right CA: /opt/kafka/cluster-ca-certs/ca.crt
Preparing keystore for replication and clienttls listener
Preparing keystore for replication and clienttls listener is complete
Preparing truststore for client authentication
Adding /opt/kafka/client-ca-certs/ca.crt to truststore /tmp/kafka/clients.truststore.p12 with alias ca
Certificate was added to keystore
Preparing truststore for client authentication is complete
Starting Kafka with configuration:
##############################
##############################
# This file is automatically generated by the Strimzi Cluster Operator
# Any changes to this file will be ignored and overwritten!
##############################
##############################

##########
# Broker ID
##########
broker.id=0
node.id=0

##########
# KRaft configuration
##########
process.roles=broker,controller
controller.listener.names=CONTROLPLANE-9090
controller.quorum.voters=0@my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090,1@my-cluster-5261ed90-kafka-1.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090,2@my-cluster-5261ed90-kafka-2.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090,3@my-cluster-5261ed90-kafka-3.my-cluster-5261ed90-kafka-brokers.namespace-0.svc.cluster.local:9090

##########
# Kafka message logs configuration
##########
log.dirs=/var/lib/kafka/data/kafka-log0

##########
# Control Plane listener
##########
listener.name.controlplane-9090.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.controlplane-9090.ssl.keystore.password=[hidden]
listener.name.controlplane-9090.ssl.keystore.type=PKCS12
listener.name.controlplane-9090.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
listener.name.controlplane-9090.ssl.truststore.password=[hidden]
listener.name.controlplane-9090.ssl.truststore.type=PKCS12
listener.name.controlplane-9090.ssl.client.auth=required

##########
# Replication listener
##########
listener.name.replication-9091.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.replication-9091.ssl.keystore.password=[hidden]
listener.name.replication-9091.ssl.keystore.type=PKCS12
listener.name.replication-9091.ssl.truststore.location=/tmp/kafka/cluster.truststore.p12
listener.name.replication-9091.ssl.truststore.password=[hidden]
listener.name.replication-9091.ssl.truststore.type=PKCS12
listener.name.replication-9091.ssl.client.auth=required

##########
# Listener configuration: PLAIN-9092
##########

##########
# Listener configuration: TLS-9093
##########
listener.name.tls-9093.ssl.keystore.location=/tmp/kafka/cluster.keystore.p12
listener.name.tls-9093.ssl.keystore.password=[hidden]
listener.name.tls-9093.ssl.keystore.type=PKCS12


##########
# Common listener configuration
##########
listeners=CONTROLPLANE-9090://0.0.0.0:9090,REPLICATION-9091://0.0.0.0:9091,PLAIN-9092://0.0.0.0:9092,TLS-9093://0.0.0.0:9093
advertised.listeners=REPLICATION-9091://my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc:9091,PLAIN-9092://my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc:9092,TLS-9093://my-cluster-5261ed90-kafka-0.my-cluster-5261ed90-kafka-brokers.namespace-0.svc:9093
listener.security.protocol.map=CONTROLPLANE-9090:SSL,REPLICATION-9091:SSL,PLAIN-9092:PLAINTEXT,TLS-9093:SSL
inter.broker.listener.name=REPLICATION-9091
sasl.enabled.mechanisms=
ssl.secure.random.implementation=SHA1PRNG
ssl.endpoint.identification.algorithm=HTTPS

##########
# User provided configuration
##########
default.replication.factor=3
inter.broker.protocol.version=3.2
log.message.format.version=3.2
min.insync.replicas=2
offsets.topic.replication.factor=3
transaction.state.log.min.isr=2
transaction.state.log.replication.factor=3
Kraft storage is already formatted
+ exec /usr/bin/tini -w -e 143 -- /opt/kafka/bin/kafka-server-start.sh /tmp/strimzi.properties
2022-05-25 08:24:32,210 INFO Registered kafka:type=kafka.Log4jController MBean (kafka.utils.Log4jControllerRegistration$) [main]
2022-05-25 08:24:32,623 INFO Setting -D jdk.tls.rejectClientInitiatedRenegotiation=true to disable client-initiated TLS renegotiation (org.apache.zookeeper.common.X509Util) [main]
2022-05-25 08:24:32,845 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Recovering unflushed segment 0 (kafka.log.LogLoader) [main]
2022-05-25 08:24:32,847 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Loading producer state till offset 0 with message format version 2 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,847 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Reloading from producer snapshot and rebuilding producer state from offset 0 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,849 INFO Deleted producer state snapshot /var/lib/kafka/data/kafka-log0/__cluster_metadata-0/00000000000000000009.snapshot (kafka.log.SnapshotFile) [main]
2022-05-25 08:24:32,851 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Producer state recovery took 3ms for snapshot load and 0ms for segment recovery from offset 0 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,882 INFO [ProducerStateManager partition=__cluster_metadata-0] Wrote producer snapshot at offset 9 with 0 producer ids in 11 ms. (kafka.log.ProducerStateManager) [main]
2022-05-25 08:24:32,916 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Loading producer state till offset 9 with message format version 2 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,916 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Reloading from producer snapshot and rebuilding producer state from offset 9 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:32,917 INFO [ProducerStateManager partition=__cluster_metadata-0] Loading producer state from snapshot file 'SnapshotFile(/var/lib/kafka/data/kafka-log0/__cluster_metadata-0/00000000000000000009.snapshot,9)' (kafka.log.ProducerStateManager) [main]
2022-05-25 08:24:32,919 INFO [LogLoader partition=__cluster_metadata-0, dir=/var/lib/kafka/data/kafka-log0] Producer state recovery took 3ms for snapshot load and 0ms for segment recovery from offset 9 (kafka.log.UnifiedLog$) [main]
2022-05-25 08:24:33,319 INFO [raft-expiration-reaper]: Starting (kafka.raft.TimingWheelExpirationService$ExpiredOperationReaper) [raft-expiration-reaper]
2022-05-25 08:24:33,519 ERROR Exiting Kafka due to fatal exception (kafka.Kafka$) [main]
java.lang.IllegalStateException: Configured voter set: [0, 1, 2, 3] is different from the voter set read from the state file: [0, 1, 2]. Check if the quorum configuration is up to date, or wipe out the local state file if necessary
	at org.apache.kafka.raft.QuorumState.initialize(QuorumState.java:132)
	at org.apache.kafka.raft.KafkaRaftClient.initialize(KafkaRaftClient.java:364)
	at kafka.raft.KafkaRaftManager.buildRaftClient(RaftManager.scala:203)
	at kafka.raft.KafkaRaftManager.<init>(RaftManager.scala:125)
	at kafka.server.KafkaRaftServer.<init>(KafkaRaftServer.scala:76)
	at kafka.Kafka$.buildServer(Kafka.scala:79)
	at kafka.Kafka$.main(Kafka.scala:87)
	at kafka.Kafka.main(Kafka.scala)

To Reproduce Steps to reproduce the behavior:

Setup CO with KRaft enabled
Create Kafka CR with 3 replicas
Scale to 4 replicas
See error in Kafka pod

Expected behavior A clear and concise description of what you expected to happen.

Environment (please complete the following information):

Strimzi version: main
Installation method: YAML
Kubernetes cluster: OpenShift 4.10
Infrastructure: Openstack

YAML files and logs Kafka with 3 replicas

apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
  kind: Kafka
  metadata:
    annotations:
      strimzi.io/pause-reconciliation: "false"
    labels:
      test.case: testPauseReconciliationInKafkaAndKafkaConnectWithConnector
    name: my-cluster-5261ed90
    namespace: namespace-0
  spec:
    kafka:
      config:
        default.replication.factor: 3
        inter.broker.protocol.version: "3.2"
        log.message.format.version: "3.2"
        min.insync.replicas: 2
        offsets.topic.replication.factor: 3
        transaction.state.log.min.isr: 2
        transaction.state.log.replication.factor: 3
      listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
      logging:
        loggers:
          kafka.root.logger.level: DEBUG
        type: inline
      replicas: 3
      storage:
        deleteClaim: true
        size: 1Gi
        type: persistent-claim
      version: 3.2.0
    zookeeper:
      logging:
        loggers:
          zookeeper.root.logger: DEBUG
        type: inline
      replicas: 3
      storage:
        deleteClaim: true
        size: 1Gi
        type: persistent-claim

Kafka with 4 replicas

apiVersion: v1
items:
- apiVersion: kafka.strimzi.io/v1beta2
  kind: Kafka
  metadata:
    annotations:
      strimzi.io/pause-reconciliation: "false"
    creationTimestamp: "2022-05-25T08:16:08Z"
    generation: 2
    labels:
      test.case: testPauseReconciliationInKafkaAndKafkaConnectWithConnector
    name: my-cluster-5261ed90
    namespace: namespace-0
    resourceVersion: "14487706"
    uid: d7843a81-0409-4769-8858-7ad8d6943a2a
  spec:
    kafka:
      config:
        default.replication.factor: 3
        inter.broker.protocol.version: "3.2"
        log.message.format.version: "3.2"
        min.insync.replicas: 2
        offsets.topic.replication.factor: 3
        transaction.state.log.min.isr: 2
        transaction.state.log.replication.factor: 3
      listeners:
      - name: plain
        port: 9092
        tls: false
        type: internal
      - name: tls
        port: 9093
        tls: true
        type: internal
      logging:
        loggers:
          kafka.root.logger.level: DEBUG
        type: inline
      replicas: 4
      storage:
        deleteClaim: true
        size: 1Gi
        type: persistent-claim
      version: 3.2.0
    zookeeper:
      logging:
        loggers:
          zookeeper.root.logger: DEBUG
        type: inline
      replicas: 3
      storage:
        deleteClaim: true
        size: 1Gi
        type: persistent-claim
  status:
    conditions:
    - lastTransitionTime: "2022-05-25T08:20:30.678Z"
      message: Error while waiting for restarted pod my-cluster-5261ed90-kafka-0 to
        become ready
      reason: FatalProblem
      status: "True"
      type: NotReady
    observedGeneration: 2

May 25 '22 08:05 Frawless

strimzi-kafka-operator strimzi-kafka-operator copied to clipboard

ScaleUp and ScaleDown is not working with KRaft mode

strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard