cryostat-legacy icon indicating copy to clipboard operation
cryostat-legacy copied to clipboard

Healthcheck error after provision on Openshift 4.10

Open mgohashi opened this issue 2 years ago • 5 comments

Hello guys,

I've provisioned an instance, but it never starts. The log details [2] is not very clear, but I have the feeling that, according to this line [1], whenever the cluster tries to a health check in the pod in order to check its readiness, the component tries to check the health of its dependencies using the route. But, the route will never be available until the health check returns 200.

[1] https://github.com/cryostatio/cryostat/blob/ed9ff7e2d13da4d6c1d51a3325098e4169845295/src/main/java/io/cryostat/net/web/http/generic/HealthGetHandler.java#L120

[2]

WARNING: Exception thrown
java.io.IOException: io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 5000ms has been exceeded while executing GET /api/health for server cryostat-sample-grafana-bookinfo.apps.cluster-dfkdw.dfkdw.sandbox1648.opentlc.com:443
at io.cryostat.net.web.http.generic.HealthGetHandler.lambda$checkUri$0(HealthGetHandler.java:156)
at io.vertx.ext.web.client.impl.HttpContext.handleFailure(HttpContext.java:309)
at io.vertx.ext.web.client.impl.HttpContext.execute(HttpContext.java:303)
at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:275)
at io.vertx.ext.web.client.impl.predicate.PredicateInterceptor.handle(PredicateInterceptor.java:70)
at io.vertx.ext.web.client.impl.predicate.PredicateInterceptor.handle(PredicateInterceptor.java:32)
at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:272)
at io.vertx.ext.web.client.impl.HttpContext.fire(HttpContext.java:282)
at io.vertx.ext.web.client.impl.HttpContext.fail(HttpContext.java:262)
at io.vertx.ext.web.client.impl.HttpContext.lambda$handleSendRequest$7(HttpContext.java:422)
at io.vertx.core.impl.FutureImpl.tryFail(FutureImpl.java:195)
at io.vertx.ext.web.client.impl.HttpContext.lambda$handleSendRequest$15(HttpContext.java:518)
at io.vertx.core.http.impl.HttpClientRequestBase.handleException(HttpClientRequestBase.java:133)
at io.vertx.core.http.impl.HttpClientRequestImpl.handleException(HttpClientRequestImpl.java:371)
at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleException(Http1xClientConnection.java:525)
at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.reset(Http1xClientConnection.java:377)
at io.vertx.core.http.impl.HttpClientRequestImpl.reset(HttpClientRequestImpl.java:294)
at io.vertx.core.http.impl.HttpClientRequestBase.handleTimeout(HttpClientRequestBase.java:195)
at io.vertx.core.http.impl.HttpClientRequestBase.lambda$setTimeout$0(HttpClientRequestBase.java:118)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:942)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:906)
at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366)
at io.vertx.core.impl.EventLoopContext.execute(EventLoopContext.java:43)
at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:229)
at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:221)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:932)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 5000ms has been exceeded while executing GET /api/health for server cryostat-sample-grafana-bookinfo.apps.cluster:443

mgohashi avatar May 05 '22 18:05 mgohashi

Hi @mgohashi , thanks for the report. I think this might be a better fit in the -operator Issues tracker, but we can keep it here for now until we determine the root cause.

The Operator should be deploying the Cryostat containers/pods and pointing those environment variables you've (correctly) identified at them. I think the Operator should be using the Service cluster-internal URL for that and not the externally routable Route URL, but maybe I'm wrong about that.

@ebaron do you have any insight on this? Has any logic about the Service/Route changed lately? Or readiness/liveness probes on the various containers?

andrewazores avatar May 05 '22 18:05 andrewazores

Hi @mgohashi, in Cryostat 2.0 the health check is indeed using the Route URL. With the upcoming 2.1 release, this will be done using a host alias to the loopback address. I'm not sure why the health check is failing using the Route in your case, but at least in 2.1 this should be simplified with the health check traffic not leaving the pod.

We expect 2.1 to be available within the next couple weeks.

ebaron avatar May 05 '22 20:05 ebaron

^ Fixed by https://github.com/cryostatio/cryostat-operator/pull/352

andrewazores avatar May 05 '22 21:05 andrewazores

Will leave this open until 2.1 is out and @mgohashi can verify the fix works. Thanks!

andrewazores avatar May 05 '22 21:05 andrewazores

@mgohashi Cryostat 2.1 is out and should be available from OperatorHub on your cluster. Please test it out and let us know the result. If you still have 2.0 installed you can upgrade, but you will need to select the "stable" update channel (not "stable-2.0"), and there is a manual upgrade step required:

oc project <cryostat_project>
cryostats=$(oc get cryostat --template \
  '{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
for cryostat in ${cryostats}; do
  oc delete svc,deploy -lapp="${cryostat}"
done

andrewazores avatar May 17 '22 15:05 andrewazores

Closing, no follow-up from reporter but we believe this is solved.

andrewazores avatar Apr 25 '23 19:04 andrewazores