strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

Topic operator throws InvalidStateStoreException repeatedly, and doesn't recover

Open LiamClarkeNZ opened this issue 2 years ago • 0 comments

Describe the bug

At times the topic operator, when using the Kafka Streams state store, will hit a state where it repeatedly throws a Streams InvalidStateStoreException with the message: "The state store, topic-store, may have migrated to another instance".

To Reproduce

One method is to delete the changelog topic from underneath the Topic Store. Another involves using min.isr and replica counts of those Streams created internal topics to prevent writing to them.

This latter method was observed by a Strimzi user who was using an old version of Topic Operator which correspondingly used an older version of Kafka Streams where replication factor for internal topics was defaulted by Streams to 1.

When Kafka Streams autocreated the changlog topic, it used the RF of 1, but the topic picked up the broker's default min.ISR of 2, thus the topic could never be written to, which caused Kafka Streams to enter an error state. TO 0.26+ uses a Kafka Streams library that changes this default to "use the cluster defaults".

Expected behavior

If the Kafka Streams instance being used for the state store enters an error state, the topic operator:

  • Produces log statements useful in debugging the failure
  • Restarts either the Kafka Stream state store, or itself

Environment (please complete the following information):

  • Strimzi version: 0.22.0 onwards
  • Installation method: multiple
  • Kubernetes cluster: multiple
  • Infrastructure: multiple

YAML files and logs

2022-04-11 09:43:13,84803 INFO  [OkHttp https://10.115.0.1/...] K8sTopicWatcher:50 - 272814|kube +a-topic|2279692560: event ADDED on resource a-topic generation=1, labels={name=a-topic, strimzi.io/cluster=kafka-cluster, team=infrastructure}
2022-04-11 09:43:13,86780 INFO  [kubernetes-ops-pool-16] CrdOperator:112 - Status of KafkaTopic a-topic in namespace kafka has been updated
2022-04-11 09:43:13,87124 ERROR [vert.x-eventloop-thread-0] K8sTopicWatcher:63 - 272814|kube +a-topic|2279692560: Failure processing KafkaTopic watch event ADDED on resource a-topic with labels {name=a-topic, strimzi.io/cluster=kafka-cluster, team=infrastructure}: The state store, topic-store, may have migrated to another instance.
org.apache.kafka.streams.errors.InvalidStateStoreException: The state store, topic-store, may have migrated to another instance.
	at org.apache.kafka.streams.state.internals.WrappingStoreProvider.stores(WrappingStoreProvider.java:54) ~[org.apache.kafka.kafka-streams-2.6.0.jar:?]
	at org.apache.kafka.streams.state.internals.CompositeReadOnlyKeyValueStore.get(CompositeReadOnlyKeyValueStore.java:52) ~[org.apache.kafka.kafka-streams-2.6.0.jar:?]
	at jdk.internal.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at io.strimzi.operator.topic.LocalStoreAndServiceFactory$LazyInvocationHandler.invoke(LocalStoreAndServiceFactory.java:60) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
	at com.sun.proxy.$Proxy23.get(Unknown Source) ~[?:?]
	at io.strimzi.operator.topic.KafkaStreamsTopicStore.read(KafkaStreamsTopicStore.java:74) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
	at io.strimzi.operator.topic.TopicOperator.getFromTopicStore(TopicOperator.java:1438) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
	at io.strimzi.operator.topic.TopicOperator.reconcileOnResourceChange(TopicOperator.java:1073) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
	at io.strimzi.operator.topic.TopicOperator$5.lambda$execute$0(TopicOperator.java:1064) ~[io.strimzi.topic-operator-0.22.1.jar:0.22.1]
	at io.vertx.core.Future.lambda$compose$3(Future.java:368) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.FutureImpl.dispatch(FutureImpl.java:105) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.FutureImpl.tryComplete(FutureImpl.java:150) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.FutureImpl.complete(FutureImpl.java:111) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:176) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.FutureImpl.handle(FutureImpl.java:21) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.ContextImpl.lambda$null$0(ContextImpl.java:327) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.vertx.core.impl.EventLoopContext.lambda$executeAsync$0(EventLoopContext.java:38) ~[io.vertx.vertx-core-3.9.1.jar:3.9.1]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [io.netty.netty-transport-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty.netty-common-4.1.60.Final.jar:4.1.60.Final]
	at java.lang.Thread.run(Thread.java:834) [?:?]
2022-04-11 09:43:13,87235 WARN  [vert.x-eventloop-thread-0] TopicOperator:125 - Failure processing KafkaTopic watch event ADDED on resource a-topic with labels {name=a-topic, strimzi.io/cluster=kafka-cluster, team=infrastructure}: The state store, topic-store, may have migrated to another instance.
2022-04-11 09:43:46,28636 INFO  [vert.x-eventloop-thread-1] TopicOperator:1249 - Starting periodic  reconciliation

LiamClarkeNZ avatar Apr 12 '22 06:04 LiamClarkeNZ