strimzi-kafka-operator icon indicating copy to clipboard operation
strimzi-kafka-operator copied to clipboard

Log more information when a rolled pod doesn't become ready

Open tombentley opened this issue 3 years ago • 6 comments

Currently the KafkaRoller logging when a pod is deleted is rather sparse, like this:

2021-07-09 11:57:14 WARN  AbstractOperator:516 - Reconciliation #33(timer) Kafka(namespace-0/my-cluster-1516560308): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-1516560308-kafka-0 to become ready

It would be helpful to log more info than this.

Exactly what was logged might therefore include the pod status and (if this was done at the same time as, or after, #5263) the broker state.

tombentley avatar Jul 09 '21 12:07 tombentley

Triaged on 26.5.2022: @LiamClarkeNZ Is this related to #6663? If not, do you still plan to work on it?

scholzj avatar May 26 '22 14:05 scholzj

Yep can do @scholzj. Will have a look today.

LiamClarkeNZ avatar Jun 08 '22 21:06 LiamClarkeNZ

Triaged on 21st June: We need to investigate what is logged currently and how the full log looks like. Not just focus on one error messages which is from something what happens after the roll. So more investigation is needed to understand if this issue still matters.

scholzj avatar Jun 21 '22 14:06 scholzj

This is an example of how it looks like today (the roll was triggered by change to the readiness probe configuration):

2022-07-07 19:58:37 INFO  OperatorWatcher:38 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Kafka my-cluster in namespace myproject was MODIFIED
2022-07-07 19:58:37 INFO  AbstractOperator:226 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Kafka my-cluster will be checked for creation or modification
2022-07-07 19:58:39 INFO  KafkaRoller:513 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Pod {my-cluster-kafka-0} needs to be restarted. Reason: [Pod has old revision]
2022-07-07 19:58:39 INFO  PodOperator:54 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Rolling pod my-cluster-kafka-0
2022-07-07 19:59:37 INFO  AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:00:37 INFO  AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:01:37 INFO  AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:02:37 INFO  AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:03:37 INFO  AbstractOperator:373 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:03:41 INFO  KafkaRoller:289 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Could not restart pod {my-cluster-kafka-0}, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:650) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:347) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	... 10 more
2022-07-07 20:03:41 ERROR Util:153 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:03:41 WARN  KafkaRoller:819 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Error waiting for pod myproject/my-cluster-kafka-0 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:03:41 ERROR AbstractOperator:247 - Reconciliation #22(watch) Kafka(myproject/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:650) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:347) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	... 10 more
2022-07-07 20:03:41 INFO  CrdOperator:113 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Status of Kafka my-cluster in namespace myproject has been updated
2022-07-07 20:03:41 WARN  AbstractOperator:532 - Reconciliation #22(watch) Kafka(myproject/my-cluster): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartAndAwaitReadiness(KafkaRoller.java:650) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:347) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	... 10 more

This is the following reconciliation which does not restart the pod anymore, just timeouts waiting for readiness. But this reconciliation obviously does not know the restart reasons anymore:

2022-07-07 20:03:41 INFO  OperatorWatcher:38 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Kafka my-cluster in namespace myproject was MODIFIED
2022-07-07 20:03:42 INFO  AbstractOperator:226 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Kafka my-cluster will be checked for creation or modification
2022-07-07 20:03:44 INFO  KafkaRoller:572 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Dynamic reconfiguration for broker 0 was successful.
2022-07-07 20:04:41 INFO  AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:05:41 INFO  AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:06:41 INFO  AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:07:41 INFO  AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:08:41 INFO  AbstractOperator:373 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Reconciliation is in progress
2022-07-07 20:08:44 INFO  KafkaRoller:289 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Could not restart pod {my-cluster-kafka-0}, giving up after 10 attempts. Total delay between attempts 127750ms
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:349) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	... 9 more
2022-07-07 20:08:44 ERROR AbstractOperator:247 - Reconciliation #39(watch) Kafka(myproject/my-cluster): createOrUpdate failed
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:349) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	... 9 more
2022-07-07 20:08:44 ERROR Util:153 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:08:44 WARN  KafkaRoller:819 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Error waiting for pod myproject/my-cluster-kafka-0 to become ready: io.strimzi.operator.common.operator.resource.TimeoutException: Exceeded timeout of 300000ms while waiting for Pods resource my-cluster-kafka-0 in namespace myproject to be ready
2022-07-07 20:08:44 WARN  AbstractOperator:532 - Reconciliation #39(watch) Kafka(myproject/my-cluster): Failed to reconcile
io.strimzi.operator.cluster.operator.resource.KafkaRoller$FatalProblem: Error while waiting for restarted pod my-cluster-kafka-0 to become ready
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$awaitReadiness$16(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:689) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.awaitReadiness(KafkaRoller.java:656) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:349) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$6(KafkaRoller.java:283) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException
	at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1886) ~[?:?]
	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2021) ~[?:?]
	at io.strimzi.operator.cluster.operator.resource.KafkaRoller.await(KafkaRoller.java:685) ~[io.strimzi.cluster-operator-0.30.0-SNAPSHOT.jar:0.30.0-SNAPSHOT]
	... 9 more

The reason for the restart is logged when the pod is deleted. Maybe it can be carried to the waiting for readiness, that would need to be investigates in the code. In any case, the followup reconciliation will nto have it anymore - only the first one might. Personally I'm not sure, I see this as something what needs to be improved, but I definitely don't oppose any improvements if someone wants them.

@tombentley WDYT based on the full log? Do you think the issue should be kept?

scholzj avatar Jul 07 '22 20:07 scholzj

Triaged on 21.7.2022: Should be kept. @tombentley wants to improve it, but at this point it is not completely clear how. @ShubhamRwt might work on this.

scholzj avatar Jul 21 '22 15:07 scholzj

+1

taylorsmithgg avatar Oct 28 '22 23:10 taylorsmithgg