strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
Topic operator throws InvalidStateStoreException repeatedly, and doesn't recover
Describe the bug
At times the topic operator, when using the Kafka Streams state store, will hit a state where it repeatedly throws a Streams InvalidStateStoreException with the message: "The state store, topic-store, may have migrated to another instance".
To Reproduce
One method is to delete the changelog topic from underneath the Topic Store. Another involves using min.isr and replica counts of those Streams created internal topics to prevent writing to them.
This latter method was observed by a Strimzi user who was using an old version of Topic Operator which correspondingly used an older version of Kafka Streams where replication factor for internal topics was defaulted by Streams to 1.
When Kafka Streams autocreated the changlog topic, it used the RF of 1, but the topic picked up the broker's default min.ISR of 2, thus the topic could never be written to, which caused Kafka Streams to enter an error state. TO 0.26+ uses a Kafka Streams library that changes this default to "use the cluster defaults".
Expected behavior
If the Kafka Streams instance being used for the state store enters an error state, the topic operator:
- Produces log statements useful in debugging the failure
- Restarts either the Kafka Stream state store, or itself
Environment (please complete the following information):
- Strimzi version: 0.22.0 onwards
- Installation method: multiple
- Kubernetes cluster: multiple
- Infrastructure: multiple
YAML files and logs
2022-04-11 09:43:13,84803 INFO [OkHttp https://10.115.0.1/...] K8sTopicWatcher:50 - 272814|kube +a-topic|2279692560: event ADDED on resource a-topic generation=1, labels={name=a-topic, strimzi.io/cluster=kafka-cluster, team=infrastructure}
2022-04-11 09:43:13,86780 INFO [kubernetes-ops-pool-16] CrdOperator:112 - Status of KafkaTopic a-topic in namespace kafka has been updated
2022-04-11 09:43:13,87124 ERROR [vert.x-eventloop-thread-0] K8sTopicWatcher:63 - 272814|kube +a-topic|2279692560: Failure processing KafkaTopic watch event ADDED on resource a-topic with labels {name=a-topic, strimzi.io/cluster=kafka-cluster, team=infrastructure}: The state store, topic-store, may have migrated to another instance.
org.apache.kafka.streams.errors.InvalidStateStoreException: The state store, topic-store, may have migrated to another instance.
at org.apache.kafka.streams.state.internals.WrappingStoreProvider.stores(WrappingStoreProvider.java:54) ~[org.apache.kafka.kafka-streams-2.6.0.jar:?]
at org.apache.kafka.streams.state.internals.CompositeReadOnlyKeyValueStore.get(CompositeReadOnlyKeyValueStore.java:52) ~[org.apache.kafka.kafka-streams-2.6.0.jar:?]
at jdk.internal.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) ~[?:?]
at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
at io.strimzi.operator.topic.LocalStoreAndServiceFactory$LazyInvocationHandler.invoke(LocalStoreAndServiceFactory.java:60) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
at com.sun.proxy.$Proxy23.get(Unknown Source) ~[?:?]
at io.strimzi.operator.topic.KafkaStreamsTopicStore.read(KafkaStreamsTopicStore.java:74) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
at io.strimzi.operator.topic.TopicOperator.getFromTopicStore(TopicOperator.java:1438) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
at io.strimzi.operator.topic.TopicOperator.reconcileOnResourceChange(TopicOperator.java:1073) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
at io.strimzi.operator.topic.TopicOperator$5.lambda$execute$0(TopicOperator.java:1064) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
at io.vertx.core.Future.lambda$compose$3(Future.java:368) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:176) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:21) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:327) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [io.netty.netty-transport-4.1.60.Final.jar:4.1.60.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
at java.lang.Thread.run(Thread.java:834) [?:?]
2022-04-11 09:43:13,87235 WARN [vert.x-eventloop-thread-0] TopicOperator:125 - Failure processing KafkaTopic watch event ADDED on resource a-topic with labels {name=a-topic, strimzi.io/cluster=kafka-cluster, team=infrastructure}: The state store, topic-store, may have migrated to another instance.
2022-04-11 09:43:46,28636 INFO [vert.x-eventloop-thread-1] TopicOperator:1249 - Starting periodic reconciliation