strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard
Log more information when a rolled pod doesn't become ready
Currently the KafkaRoller
logging when a pod is deleted is rather sparse, like this:
2021-07-09 11:57:14 WARN AbstractOperator:516 - Reconciliation #33(timer) Kafka(namespace-0/my-cluster-1516560308): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-1516560308-kafka-0 to become ready
It would be helpful to log more info than this.
Exactly what was logged might therefore include the pod status
and (if this was done at the same time as, or after, #5263) the broker state.
Triaged on 26.5.2022: @LiamClarkeNZ Is this related to #6663? If not, do you still plan to work on it?
Yep can do @scholzj. Will have a look today.
Triaged on 21st June: We need to investigate what is logged currently and how the full log looks like. Not just focus on one error messages which is from something what happens after the roll. So more investigation is needed to understand if this issue still matters.
This is an example of how it looks like today (the roll was triggered by change to the readiness probe configuration):
2022-07-07 19:58:37 INFO OperatorWatcher:38 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Kafka my-cluster in namespace myproject was MODIFIED
2022-07-07 19:58:37 INFO AbstractOperator:226 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Kafka my-cluster will be checked for creation or modification
2022-07-07 19:58:39 INFO KafkaRoller:513 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Pod {my-cluster-kafka-0} needs to be restarted. Reason: [Pod has old revision]
2022-07-07 19:58:39 INFO PodOperator:54 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Rolling pod my-cluster-kafka-0
2022-07-07 19:59:37 INFO AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:00:37 INFO AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:01:37 INFO AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:02:37 INFO AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:03:37 INFO AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:03:41 INFO KafkaRoller:289 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Could not restart pod {my-cluster-kafka-0}, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:650) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:347) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
... 10 more
2022-07-07 20:03:41 ERROR Util:153 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:03:41 WARN KafkaRoller:819 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Error waiting for pod myproject/my-cluster-kafka-0 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:03:41 ERROR AbstractOperator:247 - Reconciliation #22(watch) Kafka(myproject/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:650) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:347) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
... 10 more
2022-07-07 20:03:41 INFO CrdOperator:113 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Status of Kafka my-cluster in namespace myproject has been updated
2022-07-07 20:03:41 WARN AbstractOperator:532 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:650) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:347) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
... 10 more
This is the following reconciliation which does not restart the pod anymore, just timeouts waiting for readiness. But this reconciliation obviously does not know the restart reasons anymore:
2022-07-07 20:03:41 INFO OperatorWatcher:38 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Kafka my-cluster in namespace myproject was MODIFIED
2022-07-07 20:03:42 INFO AbstractOperator:226 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Kafka my-cluster will be checked for creation or modification
2022-07-07 20:03:44 INFO KafkaRoller:572 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Dynamic reconfiguration for broker 0 was successful.
2022-07-07 20:04:41 INFO AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:05:41 INFO AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:06:41 INFO AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:07:41 INFO AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:08:41 INFO AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:08:44 INFO KafkaRoller:289 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Could not restart pod {my-cluster-kafka-0}, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:349) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
... 9 more
2022-07-07 20:08:44 ERROR AbstractOperator:247 - Reconciliation #39(watch) Kafka(myproject/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:349) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
... 9 more
2022-07-07 20:08:44 ERROR Util:153 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:08:44 WARN KafkaRoller:819 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Error waiting for pod myproject/my-cluster-kafka-0 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:08:44 WARN AbstractOperator:532 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:349) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
... 9 more
The reason for the restart is logged when the pod is deleted. Maybe it can be carried to the waiting for readiness, that would need to be investigates in the code. In any case, the followup reconciliation will nto have it anymore - only the first one might. Personally I'm not sure, I see this as something what needs to be improved, but I definitely don't oppose any improvements if someone wants them.
@tombentley WDYT based on the full log? Do you think the issue should be kept?
Triaged on 21.7.2022: Should be kept. @tombentley wants to improve it, but at this point it is not completely clear how. @ShubhamRwt might work on this.
+1