strimzi-kafka-operator Restart loop after unclean shutdown

Describe the bug A hardware failure brought one of our brokers down and strimizi kept restarting it which made it impossible for the broker to recover the unflushed log segments.

To Reproduce You could probably reproduce this issue by sending a sigkill to a broker while running it with a very slow disk. The core issue is that if the broker takes long enough to recover strimizi will force restart it.

Expected behavior Strimizi should not force roll a broker while unflushed log segments are being recovered

Environment (please complete the following information):

Strimzi version: 0.23
Installation method: YAML
Kubernetes cluster: v1.19.8
Infrastructure: Amazon EKS

YAML files and logs

2021-07-06 21:53:51 INFO  AbstractOperator:255 - Reconciliation #5815(timer) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
2021-07-06 21:54:23 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-07-06 21:54:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:54:54 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-07-06 21:55:24 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 1000ms
2021-07-06 21:55:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:55:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:55:55 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 2000ms
2021-07-06 21:56:27 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 4000ms
2021-07-06 21:56:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:57:01 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 8000ms
2021-07-06 21:57:39 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 16000ms
2021-07-06 21:57:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:57:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:58:25 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2021-07-06 21:58:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 21:59:27 INFO  KafkaRoller:300 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2021-07-06 21:59:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 21:59:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:00:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:01:01 INFO  KafkaRoller:508 - Reconciliation #5815(timer) Kafka(new-kafka/main): Pod 3 needs to be restarted. Reason: []
2021-07-06 22:01:42 WARN  KafkaRoller:386 - Reconciliation #5815(timer) Kafka(new-kafka/main): Pod main-kafka-3 will be force-rolled, due to error: Call(callName=listNodes, deadlineMs=1625608901805, tries=1, nextAllowedTryMs=1625608901906) timed out at 1625608901806 after 1 attempt(s)
2021-07-06 22:01:42 INFO  PodOperator:65 - Rolling update of new-kafka/main-kafka: Rolling pod main-kafka-3
2021-07-06 22:01:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:01:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:02:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:03:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:03:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:04:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:05:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:05:51 INFO  AbstractOperator:399 - Reconciliation #5815(timer) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:06:42 INFO  KafkaRoller:293 - Reconciliation #5815(timer) Kafka(new-kafka/main): Could not roll pod 3, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 9 more
2021-07-06 22:06:42 ERROR AbstractOperator:276 - Reconciliation #5815(timer) Kafka(new-kafka/main): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 9 more
2021-07-06 22:06:42 INFO  OperatorWatcher:40 - Reconciliation #5822(watch) Kafka(new-kafka/main): Kafka main in namespace new-kafka was MODIFIED
2021-07-06 22:06:42 INFO  CrdOperator:108 - Status of Kafka main in namespace new-kafka has been updated
2021-07-06 22:06:42 WARN  AbstractOperator:516 - Reconciliation #5815(timer) Kafka(new-kafka/main): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$UnforceableProblem: Error while trying to restart pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$restartAndAwaitReadiness$15(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:640) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 9 more
2021-07-06 22:06:42 INFO  AbstractOperator:255 - Reconciliation #5822(watch) Kafka(new-kafka/main): Kafka main will be checked for creation or modification
2021-07-06 22:07:14 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms
2021-07-06 22:07:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:07:45 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 500ms
2021-07-06 22:07:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:08:15 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 1000ms
2021-07-06 22:08:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:08:46 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 2000ms
2021-07-06 22:09:18 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 4000ms
2021-07-06 22:09:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:09:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:09:52 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 8000ms
2021-07-06 22:10:30 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 16000ms
2021-07-06 22:10:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:11:16 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 32000ms
2021-07-06 22:11:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:11:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:12:18 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 64000ms
2021-07-06 22:12:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:13:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:13:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:13:52 INFO  KafkaRoller:508 - Reconciliation #5822(watch) Kafka(new-kafka/main): Pod 3 needs to be restarted. Reason: []
2021-07-06 22:14:33 WARN  KafkaRoller:386 - Reconciliation #5822(watch) Kafka(new-kafka/main): Pod main-kafka-3 will be force-rolled, due to error: Call(callName=listNodes, deadlineMs=1625609672873, tries=1, nextAllowedTryMs=1625609672974) timed out at 1625609672874 after 1 attempt(s)
2021-07-06 22:14:33 INFO  PodOperator:65 - Rolling update of new-kafka/main-kafka: Rolling pod main-kafka-3
2021-07-06 22:14:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:15:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:15:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:16:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:17:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:17:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:18:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:19:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:19:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:20:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:21:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:21:51 INFO  ClusterOperator:125 - Triggering periodic reconciliation for namespace new-kafka...
2021-07-06 22:22:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:23:42 INFO  AbstractOperator:399 - Reconciliation #5822(watch) Kafka(new-kafka/main): Reconciliation is in progress
2021-07-06 22:23:46 INFO  KafkaRoller:284 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not restart pod 3, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 10 more
2021-07-06 22:23:46 ERROR AbstractOperator:276 - Reconciliation #5822(watch) Kafka(new-kafka/main): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 10 more
2021-07-06 22:23:46 ERROR Util:136 - Exceeded timeout of 300000ms while waiting for Pods resource main-kafka-3 in namespace new-kafka to be ready
2021-07-06 22:23:46 WARN  KafkaRoller:807 - Reconciliation #5822(watch) Kafka(new-kafka/main): Error waiting for pod new-kafka/main-kafka-3 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource main-kafka-3 in namespace new-kafka to be ready
2021-07-06 22:23:46 INFO  CrdOperator:108 - Status of Kafka main in namespace new-kafka has been updated
2021-07-06 22:23:46 INFO  OperatorWatcher:40 - Reconciliation #5831(watch) Kafka(new-kafka/main): Kafka main in namespace new-kafka was MODIFIED
2021-07-06 22:23:46 WARN  AbstractOperator:516 - Reconciliation #5822(watch) Kafka(new-kafka/main): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:680) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:647) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:641) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:387) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:278) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:676) ~[io.strimzi.cluster-operator-0.23.0.jar:0.23.0]
	... 10 more
2021-07-06 22:23:46 INFO  AbstractOperator:255 - Reconciliation #5831(watch) Kafka(new-kafka/main): Kafka main will be checked for creation or modification

Additional context

It seems that the main cause is that https://github.com/strimzi/strimzi-kafka-operator/blob/4ab518535d1ef4feefac38845c594fa3520adc9c/cluster-operator/src/main/java/io/strimzi/operator/cluster/operator/resource/KafkaRoller.java#L521 fails when broker is recovering the log segments.

2021-07-06 22:07:14 INFO  KafkaRoller:300 - Reconciliation #5822(watch) Kafka(new-kafka/main): Could not roll pod 3 due to io.strimzi.operator.cluster.operator.resource.KafkaRoller$ForceableProblem: Error getting broker config, retrying after at least 250ms

This may also be caused by some corner case in AdminClient and not necessarily be linked to recover phase of the cluster

Jul 07 '21 14:07 igorcalabria

@tombentley @sknot-rh I guess this is something in the KafkaRoller which is your specialty. Can you have a look? I do not think this is first time I see this scenario.

Jul 07 '21 22:07 scholzj

This is a bit tricky. Kafka doesn't really give any visibility into the log recovery process except via application logs. That is, there are no metrics associated with log recovery and it happens before the socket server is started, so there's no option for any kind of AdminClient API. This means that when we start a broker we can't really tell whether it is doing log recovery and making progress or whether it is in a bad state, unless we're prepared for the operator to read the broker logs (which is not really a stable API since log messages can be changed).

When a broker is shut down normally it closes its logs cleanly and on restart log recovery is essentially instantaneous. So normally a lengthy wait would indicate some problem.

There is no good upper bound for how long to wait. It would depend on the number of logs needing significant recovery.

So how do we distinguish a perfectly normal wait (of minutes or even hours) for log recovery following a crash from the broker just not coming up for reasons unknown (where retrying by deleting the pod might help)?

A few ideas:

If the pod is available when the KafkaRoller starts then I guess we can infer that the broker probably hasn't crashed, and so don't expect to spend a long time doing log recovery. This is imperfect though, because it it would a) rely on state within the operator, which would be lost if the operator started during this process and b) wouldn't detect a broker crash that happened during shutdown.
We could monitor somehow the broker state metric and allow more time if we observe it in RecoveringFromUncleanShutdown. This is also imperfect because it doesn't prevent the broker getting stuck (e.g. infinite loop) during recovery. And it still relies on configuring a some timeout, just a longer one to be used in the recovery case.
We could also consider adding a metric (gauge) for the number of logs which are awaiting recovery. This is slightly more work, but maybe makes it easier to configure a good timeout in the recovery case, since we'd be prepared to wait so long as we observed the number decreasing. In other words the timeout becomes one about how long it should take to recover an individual log rather that all the logs. It's a bit easier to set a good default for that.

My gut feeling is that the first option is not really a good idea, so we should implement the 2nd option now and see if we can get a small KIP accepted which would allow us to eventually do the 3rd.

Thoughts?

Jul 08 '21 11:07 tombentley

Thinking about this a little more... something like 2. could also be used for shutdown. I.e. When we delete the pod we'd expect the broker state to transition RUNNING -> PENDING_CONTROLLED_SHUTDOWN -> SHUTTING_DOWN. So we could delete the pod with a long initial grace period and if we didn't see those state changes we could delete it again with a shorter grace period (I assume Kubernetes would do the right thing here, and once the grace period from the 2nd delete expired it would kill -9 the container, rather than having the first deletion take precedence).

Jul 09 '21 05:07 tombentley

Another thought I had about this: Currently when we do a restart we call KafkaRoller.awaitReadiness(pod, operationTimeoutMs, TimeUnit.MILLISECONDS); and using operationTimeoutMs there propagates to the timeout we use for readiness following the pod deletion. That seems like a bug to me. We should be prepared to wait for the timeout of the readiness probe defined in the CR, so that it's under explicit user control. Fixing this should be a priority, but it doesn't help the user in knowing what the correct time to configure it with.

If the user hasn't defined a probe in the CR then perhaps using operationTimeoutMs as a default is reasonable.

Jul 12 '21 07:07 tombentley

The way how the operationTimeoutMs is used in other places is IMHO throw an error and retry in next reconciliation. This might be the only place where the operationTimeoutMs results in some destructive action. Would that be the most basic fix? To follow the same logic here?

Jul 12 '21 07:07 scholzj

At the point where we're waiting for the pod to become ready after deleting its previous incarnation we've already done the destructive part. The existing code throws FatalProblem when the timeout is breached, which exits the reconciliation. The next reconciliation considers unready pods for rolling first (line 203), and we hit the check on line 379, where we will again timeout with a FatalError after operationTimeoutMs.

So to answer your question, I guess that's what we're doing here already.

By exiting the first reconciliation we've lost the context that it was the operator restarting the pod which was the reason that the pod is not ready. I.e. although after the first reconciliation the status should contain FatalProblem: Error while waiting for restarted pod main-kafka-3 to become ready after the 2nd we'll have lost the for restarted pod part: It just looks like a pod that's not ready. I think that matters because the reason log recovery is necessary is because the broker didn't exit cleanly and if the cause of the restart was pod deletion that points to the terminationGracePeriod being too small. The current behaviour makes it harder for the user to discover that.

By using the operationTimeoutMs for the waiting following the pod deletion it make it more likely that the user ends up in this situation.

Jul 12 '21 08:07 tombentley

Hi @tombentley,

With regard to the solution 2, a few thoughts:

How would you feel about the controller connecting to the problematic broker via JMX to query those metrics?
On losing context for why a pod roll was triggered when the reconciliation ends, assuming the story to emit pod roll reasons as K8s events has been completed, the controller could query them to recover that context.
If a broker is still showing as being in recovery, which as you mention, can take hours, I propose the roller doesn't touch it, unless a human manually annotates it to be restarted - I figure that the people running the cluster are monitoring it, so they can mark it for restart and it'll be honoured.

Jul 13 '21 02:07 LiamClarkeNZ

Connecting to JMX would be problematic from security perspective I guess. It will collide with the user facing JMX settings.

Jul 13 '21 08:07 scholzj

How would you feel about the controller connecting to the problematic broker via JMX to query those metrics?

That's one way. I guess the other would be to fit the KafkaAgent (which talks directly with Yammer metrics in the broker, thus avoiding JMX) with an HTTP server and querying that. Currently the Kafka readiness probe is based on a file that the agent writes to the file system, the presence of which an exec probe tests for. If we had an HTTP server we could use an HTTP probe instead in the future.

I don't think there's much to chose between them, really, except the agent-based approach feels like it could be more flexible in the long run.

On losing context for why a pod roll was triggered when the reconciliation ends, assuming the story to emit pod roll reasons as K8s events has been completed, the controller could query them to recover that context.

I don't think the kube apiserver makes any guarantees about how long it keeps those events around for.

I think this would make for a nice CR API:

status:
  // ...
  brokers:
  - id: 1
    brokerState: Recovery
    reason: ConfigChange
  - id: 2
    brokerState: Running

Where brokerState from directly from Kafka, with the existing Unknown state used when the container wasn't running at all (in which state user would need to consult the pod and container statuses to get a full picture of what's going on). The reasons would be essentially the same as the reasons we already log when restarting pods. The reason would be cleared once the broker was observed to enter the Running state and would be set prior to deleting the pod during a restart.

If a broker is still showing as being in recovery, which as you mention, can take hours, I propose the roller doesn't touch it, unless a human manually annotates it to be restarted - I figure that the people running the cluster are monitoring it, so they can mark it for restart and it'll be honoured.

I think you're right about not touching a broker that we know is in recovery (also STARTING, since the broker might have transitioned to RECOVERY by the time you deleted a pod for a broker that you'd observed in STARTING).

I think it's a bit trickier when it comes to people wanting to restart a pod that's in recovery. KafkaServer.shutdown starts like this:

  override def shutdown(): Unit = {
    try {
      info("shutting down")

      if (isStartingUp.get)
        throw new IllegalStateException("Kafka server is still starting up, cannot shut down!")

And the shutdown hook is registered like this:

      Exit.addShutdownHook("kafka-shutdown-hook", {
        try server.shutdown()
        catch {
          case _: Throwable =>
            fatal("Halting Kafka.")
            // Calling exit() can lead to deadlock as exit() can be called multiple times. Force exit.
            Exit.halt(1)
        }
      })

So a SIGTERM on a starting broker is effectively the same as a SIGKILL in that it results in an unclean shutdown. The current meaning of the strimzi.io/force-restart=true annotation on a broker pod is that the pod will be deleted by the KafkaRoller with the guarantees that that provides over (cluster-wide) concerns like availability. So a question in my mind is whether we should extend that guarantee so that it avoids an unclean shutdown, since if the users wants an unclean shutdown they can always delete the pod directly themselves.

I suppose there could be a benefit to being able to do an unclean shutdown with the cluster-wide guarantees that KafkaRoller is providing. So maybe we should say strimzi.io/force-restart=true is a clean shutdown with the cluster-wide guarantees (and thus we would not honour that annotation on a broker that was doing log recovery) and consider supporting strimzi.io/force-restart=unclean to enable an unclean shutdown in the future if we found it was needed.

Jul 13 '21 08:07 tombentley

Okay, to try to summarise the discussion so far, to make sure I've understood it correctly

Expose broker state to the operator via KafkaAgent
Leave brokers in STARTING, RECOVERY, alone.
Sending SIGTERM to a broker starting up will lead to an unclean shutdown.
Using the operator timeout to wait for a broker to come up could cause unclean shutdowns, and perhaps a readiness probe derived timeout is better, as it gives the user explicit control over that timeout
The operator should store reconciliation state as a CR.

Do I have this correct so far? And am I missing anything pertinent?

Questions

To fix this issue ASAP, what of the above do we need?
After we've fixed this particular issue, what other changes are necessary to prevent issues?

Jul 14 '21 06:07 LiamClarkeNZ

Assuming this comment is correct I assume the root cause of the original report is that the readiness probes caused kube to restart the pod (i.e. it wasn't the operator). It would be good to confirm that.

I would add the following to your list:

when we fail a reconciliation due to a timeout when awaiting readiness due to ongoing log recovery we should try to improve the error message to include that the broker is doing log recovery.
We should add a metric for the time the controller observes the broker spending in each phase. The idea here is to surface in the metrics the normal time spent doing recover and controller shutdown. This should provide users the info they need for configuring timeouts correctly
A procedure for tuning the terminationGracePeriod and the readiness timeout should be added to the docs.

I don't think further changes would be needed if we did all these things (well, maybe there's some value to using startup probes, but I think that would be a bit tricky with compatibility with older versions of kube).

@scholzj are you happy with this plan?

Jul 14 '21 10:07 tombentley

One other thought: It would be nice if we logged a warning when we observed recovery that was within, say, 20% of the timeout. Ditto for termination. And an error when they were exceeded. Each pointing the user to where they should configure revised values.

Jul 14 '21 11:07 tombentley

@tombentley

On 6) should we fail a reconciliation if the broker is in RECOVERY? Assuming that the broker is able to view the broker state.

On 7) So we should expose that as a controller metric? Broker state should already be available via the Kafka JMX exporter, and those phase times derivable from that metric. But happy to make it more easily accessible via controller metrics.

I'm very keen to do so, I'll need to research that readiness probe to properly understand it first.

And yes, logging warnings for those cases would be good. Quick question though - when you say "20% of the timeout", which timeout do you mean?

Jul 14 '21 15:07 LiamClarkeNZ

@tombentley @scholzj if you both agree on the way forward, I'll capture the specific tasks (as I understand them, my usual caveat) in new issue(s), depending on how granular they can be, and I'll ping you to review them. :)

Jul 14 '21 15:07 LiamClarkeNZ

Long discussion and it is hard to comment.

So maybe we should say strimzi.io/force-restart=true is a clean shutdown with the cluster-wide guarantees (and thus we would not honour that annotation on a broker that was doing log recovery) and consider supporting strimzi.io/force-restart=unclean to enable an unclean shutdown in the future if we found it was needed.

We should keep the existing logic. For backwards compatibility, but also things such as Drain Cleaner depend on it. The annotation is actually named strimzi.io/manual-rolling-update ... so we can still add strimzi.io/force-restart to force restart it without KafkaRoller. But in a way the user can just use kubectl delete pods instead.

Using JMX or the Prometheus metrics is complicated. It makes sense to expose additional info though it. But Prometheus metrics are configured by users so they might disable your metrics or disable metric overall. JMX is similar + plus I'm not sure the security is good enough to make it enabled by default. So I guess having some simple HTTP server in the kafka-agent to expose this information might be easiest. It might also simplify the probes using the files today maybe? If I got it right, My understanding is that the readiness probe alone works. But the KafkaRoller kills it in the next loop, so we just need to detect the recovery and skip it in the next loop?

I'm not sure what do we need the CR status for. The status will not store if the pod restarted for some other reason and is recovering, we should follow the same rules as if the pod was restarted by the RU in previous reconciliation. So what does it help with actually? You anyway need to get the information about the recovery from the pod, or?

Jul 14 '21 19:07 scholzj

But the KafkaRoller kills it in the next loop, so we just need to detect the recovery and skip it in the next loop?

I might well be wrong, but I don't see how this is compatible with my reading of the code. I think we need to confirm exactly what's going on (@LiamClarkeNZ maybe you could try to reproduce?).

I'm not sure what do we need the CR status for...

I'm happy to skip this CR change for now, since I agree is still a bit too vague to be useful.

Jul 15 '21 17:07 tombentley

@LiamClarkeNZ

On 7) So we should expose that as a controller metric? Broker state should already be available via the Kafka JMX

I think the main things is that we can get make it accessible to users so they can use it to inform how they set the timeouts. I'm not too bothered about exactly how. So if prometheus could derive this and we added it to the dashboard then I don't think there's a need to have this as an operator metric.

Quick question though - when you say "20% of the timeout", which timeout do you mean?

I think I mean the readiness timeout and the terminationGracePeriod.

Jul 15 '21 17:07 tombentley

We also experienced this while installing Cruise Control. This caused the pods to restart with the new configuration for metrics and one of the pods went into recovery mode and the operator kept killing the pod not allowing it to finish recovery. We ended up pausing the operator to let the broker come back up and finish recovery.

Are there any manual workarounds for this, or extended timeouts we can apply? I know we cannot determine how long it will take a pod to recover however I am curious if we can extend the timeout easily to something a little longer than what's currently configured.

Jun 22 '22 15:06 timothystewart6

Triaged on 21.7.2022: The plan is to improve the KafkaRoller and the agent in Kafka to provide more details about the current Kafka state. That will allow us to handle this issue in a better way. This should be kept opened, proposal needed.

Jul 21 '22 15:07 scholzj

Are there any manual workarounds for this, or extended timeouts we can apply? We managed to work around the problem by setting strimzi-kafka-operator.operationTimeoutMs to a value that is larger than the restart time after unclean shutdown. 20 minutes worked quite well for us.

Aug 01 '22 10:08 maowerner

I semi-randomly hit this problem in my EKS cluster, most often when one of the cluster nodes have been in an unhealthy state, and one of the Kafka pods have been moved to a different node.

I think one of the things that happens here is that when kafka is moved to a different node, the Linux filesystem (memory) cache for the filesystem is lost, meaning that all disk access actually have to hit the disk.

Another factor is that I'm running my disks on AWS EBS st1 volumes, which have a throughput credits model, so what I typically will see when this happens is:

Kafka is moved to different node. Starts loading data from disk. This consumes some of the burst credits of the st1 disk.
Cluster operator will time out and kill the kafka pod.

This will iterate a few times until the burst credits are exhausted, and the disk I/O is down to the base rate (which depends on the size of the volume), but which makes the process even slower, and more or less guarantee that the pod will not recover.

On the first X attempts, before burst credits are exhausted, in my setup, it will manage to load some 120/405 Logs on first attempts, then som 210/405 Logs on consecutive attempts (due to disk cache being populated).

After burst credits are exhausted, it manages to load 20/405 Logs before Cluster operator kill the pod.

At this stage, which is of course where I find the cluster an early morning when I'm about to travel, I usually have to turn off cluster operator, then hand-edit the statefulset to make the readiness-probe make 1800 attempts every 10 seconds, then let the cluster slowly recover.

I will try setting operationTimeoutMs to a higher value and see if this helps, but as discussed above this needs a better root cause fix.

In addition to letting cluster operator be smarter during this phase, I'm pondering if a startup probe would make sense here, on the pod level. Having to set a really high failureThreshold on the readinessProbe doesn't feel optimal, given that the readinessProbe result governs if a pod is part of the bootstrap Service or not. On the other hand, I don't know if the current implementation of the readinessProbe will ever go back to unhealthy state after Kafka opened its port.

So that's my 5 cents.

How is the behaviour changed with StrimziPodSets? I'm a bit behind (0.27.1/Kafka 2.8.0), so not using them yet. A bit worried I won't be able to get my pods started if I can't modify a statefulset when needed.

Jun 14 '23 15:06 forsberg

@tinaselenge is working on fix to the operator to not stop the pod during recovery in #8601.

You should also check how are your pods terminating => maybe you need to increase the termination grace period to give it more time to cleanly shutdown (unless this happens to you only after your node is killed by AWS for example without a proper shutdown procedure of course).

Jun 14 '23 16:06 scholzj

That is really good news, much appreciated! It will for sure make my life easier, as this has been one of the things in the K8S cluster that would not automatically recover.

I guess that still leaves me having to configure the readiness probe failureThreshold and periodSeconds manually to fit the environment, to avoid the pod getting killed by Kubernetes in these situations. If this can also be automated, that would help the inexperienced operator.

One idea would be to have the Cluster Operator create a startupProbe where the failureThreshold and periodSeconds by default are roughly the same as the max number of attempts and the delay between attempts in the cluster operator's settings, and have this probe return ready when the http endpoint indicates broker is done recovering.

Jun 14 '23 19:06 forsberg

I don't think startup probe is a solution -> for this since the process can take long. Kubernetes should normally not kill the pod which is handling liveness probe.

Jun 14 '23 22:06 scholzj

I think you just made me realize a thing or two about readiness/liveness probes, and how the strimzi liveness probe actually works.

Thanks! :-)

Jun 15 '23 05:06 forsberg

@scholzj should this be closed with the release of 0.36?

Aug 02 '23 15:08 douglasawh

Good point. Thanks. Should be fixed in #8601.

Aug 02 '23 15:08 scholzj

strimzi-kafka-operator strimzi-kafka-operator copied to clipboard

Restart loop after unclean shutdown

strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard