strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

InvalidStateException on topic after upgrade to Strimzi 0.29.0 and Kafka 3.2

Open wishkres opened this issue 3 years ago • 5 comments
trafficstars

Describe the bug I recently completed an upgrade from Strimzi version 0.27.1 and Kafka 2.8 to Strimzi 0.29.0 and Kafka 3.2. My existing topics seem to be working correctly, but when I try to create a new topic after the upgrade, the topic never becomes ready with a reason of "InvalidStateException." My attempts to delete the topic have also failed.

To Reproduce Steps to reproduce the behavior:

  1. Start on Strimzi 0.27.1 and Kafka 2.8.
  2. Upgrade to Strimzi 0.29.0 and Kafka 3.2 using the documented instructions.
  3. Attempted to deploy the following Kafka topic:
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
  name: deletethis
  labels:
    strimzi.io/cluster: demo
spec:
  partitions: 10
  replicas: 3
  config:
    retention.ms: 259200000
    segment.bytes: 1073741824

Result: The topic is created, but the status states NotReady with a reason of InvalidStateException.

The topic operator says the following in the logs:

INFO [vert.x-eventloop-thread-1] TopicOperator:577 - Reconciliation #1811754(periodic kube deletethis) KafkaTopic(kafka/deletethis): Reconciling topic deletethis, k8sTopic: nonnull, kafkaTopic: nonnull, privateTopic: null
WARN [vert.x-eventloop-thread-1] TopicOperator:134 - Reconciliation #1811983(periodic kube deletethis) KafkaTopic(kafka/deletethis): io.strimzi.operator.topic.TopicStore$InvalidStateException

Expected behavior The topic to be created appropriately in the ready state and the ability to delete the topic.

Environment (please complete the following information):

  • Strimzi version: 0.29.0
  • Installation method: Helm chart
  • Kubernetes cluster: Kubernetes 1.20, Rancher v2.5.8
  • Infrastructure: VMWare

YAML files and logs

Cluster:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: demo
spec:
  kafka:
    version: 3.2.0
    replicas: 3
    listeners:
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
        configuration:
          brokerCertChainAndKey:
            secretName: brokercert
            certificate: broker.crt
            key: broker.key
      - name: external
        port: 9094
        type: ingress
        tls: true
        authentication:
          type: tls
        configuration:
          bootstrap:
            host: "demo.kafka.SERVERNAMEGOESHERE"
          brokers:
            - broker: 0
              host: "broker-0.demo.kafka.SERVERNAMEGOESHERE"
            - broker: 1
              host: "broker-1.demo.kafka.SERVERNAMEGOESHERE"
            - broker: 2
              host: "broker-2.demo.kafka.SERVERNAMEGOESHERE"
          brokerCertChainAndKey:
            secretName: brokercert
            certificate: broker.crt
            key: broker.key
    authorization:
      type: simple
    config:
      offsets.topic.replication.factor: 2
      transaction.state.log.replication.factor: 1
      transaction.state.log.min.isr: 1
      inter.broker.protocol.version: "3.2"
    storage:
       type: persistent-claim
       class: nas-store
       size: 10Gi
  zookeeper:
    replicas: 3
    storage:
      type: persistent-claim
      class: nas-store
      size: 10Gi
      deleteClaim: false
  entityOperator::
    topicOperator: {}

wishkres avatar Jul 05 '22 14:07 wishkres

Please share the full log from the Topic Operator. Without it, there is no chance to actually understand the issue. Also, please check existing issues, there might similar ones already.

scholzj avatar Jul 05 '22 14:07 scholzj

Certainly. I have to go through an approval process to get the log file, but I will post it as soon as I can.

Additionally to add -- I just restarted the topic operator and now all my topics are reported as InvalidStateException.

wishkres avatar Jul 05 '22 15:07 wishkres

Sounds like something is wrong with the Topic Operator internal store. But without the logs it is hard to say what. Deleting the Topic Operator internal topics might help to recover. But its just a guess at this point.

scholzj avatar Jul 05 '22 16:07 scholzj

PS: This might be one of the duplicates: #6671

scholzj avatar Jul 05 '22 22:07 scholzj

Thank you for pointing out that duplicate! Although the error message in my log was not the exact one as what was listed in that particular post, the issue was the same.

We started out originally using Strimzi 0.25.0, so as per that issue, our internal topics were created with a replica of 1. During my attempts at upgrading from 0.27.1 to 0.29.0 I was playing around with putting the min.insync.replicas setting of 2 on the cluster and thought I removed it, but I double-checked after your post and it was still set within Kubernetes. Removing min.insync.replicas fixed the issue and everything works again. I will work on increasing the number of replicas on those topics appropriately so I can set the min.insync.replicas value.

I also got approval to post the log file, so here it is attached for completeness. kafka-entity-operator.txt

wishkres avatar Jul 06 '22 15:07 wishkres

Closing as duplicate of #6671.

scholzj avatar Aug 18 '22 14:08 scholzj