strimzi-kafka-operator Add shutdown hook to user and topic operators.

This change is required to ensure that Verticle.stop() is called during a clean shutdown, so that watches can be closed and any inflight reconciliation has a chance to complete.

I'm also moving the hook to the operator-common module as it is a shared need.

Aug 06 '22 16:08 fvaleri

Can one of the admins verify this patch?

Aug 06 '22 16:08 strimzi-ci

The UO just closes its watch, but the TO waits some time for inflight reconciliations. The problem is that this logic is never executed in case of clean shutdown. We should have something similar in the UO.

Aug 06 '22 16:08 fvaleri

The UO just closes its watch, but the TO waits some time for inflight reconciliations. The problem is that this logic is never executed in case of clean shutdown. We should have something similar in the UO.

Well, you seem to be adding it to all operators but you say that only the Topic Operator. needs it. That seems weird.

Aug 06 '22 16:08 scholzj

Well, you seem to be adding it to all operators but you say that only the Topic Operator. needs it. That seems weird.

What's the point of putting any logic in Verticle.stop() if it is never called?

The CO already has it, the TO needs it. It would be good if the CO and UO could handle inflight reconciliations in a similar way.

Aug 06 '22 17:08 fvaleri

Well, you seem to be adding it to all operators but you say that only the Topic Operator. needs it. That seems weird.

What's the point of putting any logic in Verticle.stop() if it is never called?

The CO already has it, the TO needs it. It would be good if the CO and UO could handle inflight reconciliation in a similar way.

I don't know how TO handles this. But in general, you have limited time to shutdown so there is no time to wait for reconciliation to finish. The operator is stateless -> it should be able to stop and recover in other instance at any time.

Aug 06 '22 17:08 scholzj

I think the terminationGracePeriodSeconds is 30 by default, which gives some chance to complete an inflight reconciliation. Of course the user can increase or decrease it as needed.

Aug 06 '22 17:08 fvaleri

/azp run regression

Aug 06 '22 20:08 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Aug 06 '22 20:08 azure-pipelines[bot]

PS: The shutdown hook does not seem to work well when the verticles don't start. Is always waits for the 2 minute timeout. Maybe it should be set-up only after the startup?

Aug 08 '22 08:08 scholzj

Thanks. I'll look into that. Do you get the "Failure in stopping Vertx" error message with the stack trace? Can you post here?

Aug 08 '22 08:08 fvaleri

No, I get only

2022-08-08 08:31:00 INFO  ShutdownHook:27 - Shutting down

And a lot of waiting. I did not measured it, but it could be the ~2 minutes. I guess the last message might be lost without a proper logging stack easily, so not sure about that.

Aug 08 '22 08:08 scholzj

PS: The way I run into it (while testing something else)

Deploy the operator as usuallly
Edit the deployment and configure it to watch two namespaces
- In my case one was the original one with all RBAC present and the other was just a random namespace without any RBAC rights
- => this should send the operator crash looping because of the missing RBAC

Aug 08 '22 08:08 scholzj

@scholzj I think that the delay is due to the ClusterOperator.stop() that never completes or is very slow under this circumstances. As a consequence, this consumes all the hard coded ShutdownHook timeout (2 minutes) during a clean shutdown. Let me explain.

This is the thread which is blocking System.exit():

"vert.x-eventloop-thread-0" #24 prio=5 os_prio=0 cpu=62.20ms elapsed=113.63s tid=0x000055c33c3e6000 nid=0x28 in Object.wait()  [0x00007f53b0394000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait([email protected]/Native Method)
	- waiting on <0x00000000fd6edbc0> (a java.lang.Thread)
	at java.lang.Thread.join([email protected]/Thread.java:1300)
	- waiting to re-lock in wait() <0x00000000fd6edbc0> (a java.lang.Thread)
	at java.lang.Thread.join([email protected]/Thread.java:1375)
	at java.lang.ApplicationShutdownHooks.runHooks([email protected]/ApplicationShutdownHooks.java:107)
	at java.lang.ApplicationShutdownHooks$1.run([email protected]/ApplicationShutdownHooks.java:46)
	at java.lang.Shutdown.runHooks([email protected]/Shutdown.java:130)
	at java.lang.Shutdown.exit([email protected]/Shutdown.java:174)
	- locked <0x00000000fd275ea0> (a java.lang.Class for java.lang.Shutdown)
	at java.lang.Runtime.exit([email protected]/Runtime.java:116)
	at java.lang.System.exit([email protected]/System.java:1752)
	at io.strimzi.operator.cluster.Main.lambda$run$3(Main.java:168)
	at io.strimzi.operator.cluster.Main$$Lambda$290/0x0000000840406c40.handle(Unknown Source)
	at io.vertx.core.impl.future.FutureImpl$3.onFailure(FutureImpl.java:153)
	at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75)
	at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230)
	at io.vertx.core.impl.future.Mapping.onFailure(Mapping.java:45)
	at io.vertx.core.impl.future.FutureBase.emitFailure(FutureBase.java:75)
	at io.vertx.core.impl.future.FutureImpl.tryFail(FutureImpl.java:230)
	at io.vertx.core.impl.future.PromiseImpl.tryFail(PromiseImpl.java:23)
	at io.vertx.core.impl.future.PromiseImpl.onFailure(PromiseImpl.java:54)
	at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:43)
	at io.vertx.core.impl.future.PromiseImpl.handle(PromiseImpl.java:23)
	at io.vertx.core.impl.DeploymentManager.lambda$reportResult$2(DeploymentManager.java:136)
	at io.vertx.core.impl.DeploymentManager$$Lambda$311/0x000000084041c840.handle(Unknown Source)
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:100)
	at io.vertx.core.impl.AbstractContext.dispatch(AbstractContext.java:63)
	at io.vertx.core.impl.EventLoopContext.lambda$runOnContext$0(EventLoopContext.java:38)
	at io.vertx.core.impl.EventLoopContext$$Lambda$294/0x0000000840418840.run(Unknown Source)
	at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:995)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run([email protected]/Thread.java:829)

We know that the cluster operator verticle is failing to start. This triggers Vertx DeploymentManager.rollback(), which ends up calling DeploymentManager.reportResult() (reference in the stack). If you look at this method, you'll see that the DeploymentManager.DeploymentImpl.doUndeploy() handler is called, which calls Verticle.stop().

The issue: if a Verticle.stop() override fails to call the future, the stop will never complete. This is also noted in DeploymentManager.undeployAll() method, which is called when you do Vertx.close(). This is exactly why I added the 2 minutes timeout in the ShutdownHook. I made the timeout configurable in the last commit, so that we can specify a different value for each operator.

Does it make sense?

Aug 08 '22 16:08 fvaleri

I don't think the configurability solves it. What is the right timeout for something what never timeouts? 0? Why not setup the hook only after a successful start as I suggested? That will at least eliminate this issues when it fails at startup?

Aug 08 '22 16:08 scholzj

I don't think the configurability solves it.

I didn't say that configurability solves the issue, it is just an enhancement.

Why not setup the hook only after a successful start as I suggested? That will at least eliminate this issues when it fails at startup?

Yes, but not on a clean shutdown where, for whatever reason (a bug), the stop does not completes.

What is the right timeout for something what never timeouts? 0?

My point is that we should fix the bug.

Aug 08 '22 16:08 fvaleri

My point is that we should fix the bug.

Sorry, what do you mean with the bug? You mean something in Vert.x?

Aug 08 '22 16:08 scholzj

So Vertx should timeout if the verticle stop takes too long, but we should guarantee to always call stop.complete() in ClusterOperator.stop(). We can simply comment the stop logic to see if my theory holds. Let me do that.

Aug 08 '22 16:08 fvaleri

I did a quick test commenting out all CO stop logic except stop.complete() and now I don't have the following blocked thread:

"SIGTERM handler" #28 daemon prio=9 os_prio=0 cpu=0.71ms elapsed=29.91s tid=0x00007f53cd163000 nid=0x3c waiting for monitor entry  [0x00007f53b0d9e000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.lang.Shutdown.exit([email protected]/Shutdown.java:173)
	- waiting to lock <0x00000000fd275ea0> (a java.lang.Class for java.lang.Shutdown)
	at java.lang.Terminator$1.handle([email protected]/Terminator.java:51)
	at jdk.internal.misc.Signal$1.run([email protected]/Signal.java:220)
	at java.lang.Thread.run([email protected]/Thread.java:829)

Aug 08 '22 17:08 fvaleri

@scholzj the bug could be in Vertx.close(), because CO.stop() is never called when the CO verticle fails at startup. One ugly workaround to make it fail fast is to replace System.exit(1) with Runtime.getRuntime().halt(1), so that we do not trigger Vertx.close() in that specific case. My investigation on why Vertx.close() hangs continue...

Aug 09 '22 10:08 fvaleri

@scholzj the bug could be in Vertx.close(), because CO.stop() is never called when the CO verticle fails at startup. One ugly workaround to make it fail fast is to replace System.exit(1) with Runtime.getRuntime().halt(1), so that we do not trigger Vertx.close() in that specific case. My investigation on why Vertx.close() hangs continue...

Well, we can just remove the shutdown hook and keep using the current way to stopping the operator if we want to skip closing the Vertx. I also still don't understand why we cannot just configure the shutdown hook after all verticles are started which would bypass at least some of these cases.

Aug 09 '22 11:08 scholzj

@scholzj the issue should be fixed now.

It is actually due to a Vertx bug, which hangs when you call close with no Verticle deployed. This is exactly the state we have when the CO fails to start because of the RBAC issue.

I also added try-catch inside the StrimziPodSetController constructor, so that it is clear where the CO is failing without having to look at the stack. Let me know if you are good with that.

Aug 09 '22 14:08 fvaleri

/azp run regression

Aug 10 '22 11:08 scholzj

Azure Pipelines successfully started running 1 pipeline(s).

Aug 10 '22 11:08 azure-pipelines[bot]

strimzi-kafka-operator strimzi-kafka-operator copied to clipboard

Add shutdown hook to user and topic operators.

strimzi-kafka-operator
strimzi-kafka-operator copied to clipboard