quarkus-operator-sdk icon indicating copy to clipboard operation
quarkus-operator-sdk copied to clipboard

5.0.0 - OperatorHealthCheck race condition?

Open bhellema opened this issue 2 years ago • 8 comments

Since upgrading to 5.0.0 we're noticing different behavior when it comes to starting up our operator. Sometimes we see the following health check status in our logs indicating that Quarkus Operator SDK health check is DOWN.

{"time":"2023-01-19T06:03:38.017Z","class":"io.smallrye.health","level":"INFO","message":"SRHCK01001: Reporting health down status: {\"status\":\"DOWN\",\"checks\":[{\"name\":\"Quarkus Operator SDK health check\",\"status\":\"DOWN\"},{\"name\":\"K8s connected to kubernetes api version v1\",\"status\":\"UP\"},{\"name\":\"Quarkus Operator SDK health check\",\"status\":\"UP\",\"data\":{\"myfirstcontroller\":\"OK\",\"mysecondcontroller\":\"OK\"}},{\"name\":\"K8s connected to kubernetes api version v1\",\"status\":\"UP\"}]}"}

Could there be a race condition here on this line?

@metacosm any chance you have some insight here wrt to this potentially being a race condition? I can have mixed results when starting our operator up.

bhellema avatar Jan 19 '23 21:01 bhellema

@bhellema how often you see this ? is it resolvedd shortly after?

thinking if this might be an issue in JOSDK:

https://github.com/java-operator-sdk/java-operator-sdk/blob/9d274534f8a222c931b05dc05490b744a9ea6ea4/operator-framework-core/src/main/java/io/javaoperatorsdk/operator/health/InformerHealthIndicator.java#L11-L14

So, the ... && hasSynced() && isWatching() is also part of the expression. So hasSynced might be problem on startup. But isWatching could be problem any time, since watch might be down.

However this coin has two sides, if the informer is running but not watching it is a problem. But only if it happens for a longer time period. Since it might resemble a real problem with the watch (like lost permissions to watch a resource). I will add log messages for the next patch release, so we can verify if this is the problem.

csviri avatar Jan 26 '23 12:01 csviri

see: https://github.com/java-operator-sdk/java-operator-sdk/pull/1737

csviri avatar Jan 26 '23 13:01 csviri

@csviri the issue showed up pretty frequently and wouldn't resolve.

Is there a release with this debug logging that I can try?

bhellema avatar Jan 28 '23 16:01 bhellema

@bhellema just released it: https://github.com/java-operator-sdk/java-operator-sdk/releases/tag/v4.2.4

see logging: https://github.com/java-operator-sdk/java-operator-sdk/pull/1737/files

csviri avatar Jan 30 '23 13:01 csviri

The corresponding Quarkus extension release is 5.0.4. Please let us know if that addresses your issue.

metacosm avatar Feb 08 '23 18:02 metacosm

I believe this has been addressed in the 5.0.4 version so closing this issue for now. Please re-open if you're still encounter the problem.

metacosm avatar Feb 20 '23 10:02 metacosm

@metacosm, sorry for the very long delay in getting back to this, I finally had a bit of time to try to upgrade things. I'm still having issues using the latest quarkus-operator-sdk 6.0.1.

It looks like we're having issues getting the Kubernetes Client injected into the Health check for some reason after having upgraded. I'm still investigating.

{"time":"2023-05-10T17:28:52.489Z","class":"io.smallrye.health","level":"ERROR","message":"SRHCK01000: Error processing Health Checks [Error Occurred After Shutdown]","stackTrace":"java.lang.RuntimeException: Error injecting io.fabric8.kubernetes.client.KubernetesClient com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck.kubernetesClient\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.doCreate(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.create(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_Bean.create(Unknown Source)\n\tat io.quarkus.arc.impl.AbstractSharedContext.createInstanceHandle(AbstractSharedContext.java:113)\n\tat io.quarkus.arc.impl.AbstractSharedContext$1.get(AbstractSharedContext.java:37)\n\tat io.quarkus.arc.impl.AbstractSharedContext$1.get(AbstractSharedContext.java:34)\n\tat io.quarkus.arc.impl.LazyValue.get(LazyValue.java:26)\n\tat io.quarkus.arc.impl.ComputingCache.computeIfAbsent(ComputingCache.java:69)\n\tat io.quarkus.arc.impl.AbstractSharedContext.get(AbstractSharedContext.java:34)\n\tat io.quarkus.arc.impl.ClientProxies.getApplicationScopedDelegate(ClientProxies.java:21)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_ClientProxy.arc$delegate(Unknown Source)\n\tat com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck_ClientProxy.call(Unknown Source)\n\tat io.smallrye.context.impl.wrappers.SlowContextualSupplier.get(SlowContextualSupplier.java:21)\n\tat io.smallrye.mutiny.operators.uni.builders.UniCreateFromItemSupplier.subscribe(UniCreateFromItemSupplier.java:28)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniOnFailureFlatMap.subscribe(UniOnFailureFlatMap.java:31)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniOnItemTransform.subscribe(UniOnItemTransform.java:22)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination$UniHandler.subscribe(UniAndCombination.java:233)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination$AndSupervisor.run(UniAndCombination.java:87)\n\tat io.smallrye.mutiny.operators.uni.UniAndCombination.subscribe(UniAndCombination.java:53)\n\tat io.smallrye.mutiny.operators.AbstractUni.subscribe(AbstractUni.java:36)\n\tat io.smallrye.mutiny.operators.uni.UniBlockingAwait.await(UniBlockingAwait.java:60)\n\tat io.smallrye.mutiny.groups.UniAwait.atMost(UniAwait.java:65)\n\tat io.smallrye.health.SmallRyeHealthReporter.getReadiness(SmallRyeHealthReporter.java:232)\n\tat io.smallrye.health.SmallRyeHealthReporter_ClientProxy.getReadiness(Unknown Source)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeReadinessHandler.getHealth(SmallRyeReadinessHandler.java:11)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.doHandle(SmallRyeHealthHandlerBase.java:44)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:31)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeReadinessHandler.handle(SmallRyeReadinessHandler.java:7)\n\tat io.quarkus.smallrye.health.runtime.SmallRyeHealthHandlerBase.handle(SmallRyeHealthHandlerBase.java:19)\n\tat io.vertx.ext.web.impl.BlockingHandlerDecorator.lambda$handle$0(BlockingHandlerDecorator.java:48)\n\tat io.vertx.core.impl.ContextBase.lambda$null$0(ContextBase.java:137)\n\tat io.vertx.core.impl.ContextInternal.dispatch(ContextInternal.java:264)\n\tat io.vertx.core.impl.ContextBase.lambda$executeBlocking$1(ContextBase.java:135)\n\tat io.quarkus.vertx.core.runtime.VertxCoreRecorder$14.runWith(VertxCoreRecorder.java:576)\n\tat org.jboss.threads.EnhancedQueueExecutor$Task.run(EnhancedQueueExecutor.java:2513)\n\tat org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1538)\n\tat org.jboss.threads.DelegatingRunnable.run(DelegatingRunnable.java:29)\n\tat org.jboss.threads.ThreadLocalResettingRunnable.run(ThreadLocalResettingRunnable.java:29)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Thread.java:834)\nCaused by: java.lang.NullPointerException\n\tat io.quarkus.it.openshift.client.runtime.OpenShiftClientProducer_ProducerMethod_openShiftClient_ccf49c646fbe836d0e8b8406400d0ad276cd5438_Bean.get(Unknown Source)\n\tat io.quarkus.it.openshift.client.runtime.OpenShiftClientProducer_ProducerMethod_openShiftClient_ccf49c646fbe836d0e8b8406400d0ad276cd5438_Bean.get(Unknown Source)\n\t... 43 more\n","errorType":"java.lang.RuntimeException","errorMessage":"Error injecting io.fabric8.kubernetes.client.KubernetesClient com.adobe.granite.contentbackflow.operator.probes.ContentBackflowHealthCheck.kubernetesClient"}

bhellema avatar May 10 '23 17:05 bhellema