cryostat-legacy
cryostat-legacy copied to clipboard
Healthcheck error after provision on Openshift 4.10
Hello guys,
I've provisioned an instance, but it never starts. The log details [2] is not very clear, but I have the feeling that, according to this line [1], whenever the cluster tries to a health check in the pod in order to check its readiness, the component tries to check the health of its dependencies using the route. But, the route will never be available until the health check returns 200.
[1] https://github.com/cryostatio/cryostat/blob/ed9ff7e2d13da4d6c1d51a3325098e4169845295/src/main/java/io/cryostat/net/web/http/generic/HealthGetHandler.java#L120
[2]
WARNING: Exception thrown
java.io.IOException: io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 5000ms has been exceeded while executing GET /api/health for server cryostat-sample-grafana-bookinfo.apps.cluster-dfkdw.dfkdw.sandbox1648.opentlc.com:443
at io.cryostat.net.web.http.generic.HealthGetHandler.lambda$checkUri$0(HealthGetHandler.java:156)
at io.vertx.ext.web.client.impl.HttpContext.handleFailure(HttpContext.java:309)
at io.vertx.ext.web.client.impl.HttpContext.execute(HttpContext.java:303)
at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:275)
at io.vertx.ext.web.client.impl.predicate.PredicateInterceptor.handle(PredicateInterceptor.java:70)
at io.vertx.ext.web.client.impl.predicate.PredicateInterceptor.handle(PredicateInterceptor.java:32)
at io.vertx.ext.web.client.impl.HttpContext.next(HttpContext.java:272)
at io.vertx.ext.web.client.impl.HttpContext.fire(HttpContext.java:282)
at io.vertx.ext.web.client.impl.HttpContext.fail(HttpContext.java:262)
at io.vertx.ext.web.client.impl.HttpContext.lambda$handleSendRequest$7(HttpContext.java:422)
at io.vertx.core.impl.FutureImpl.tryFail(FutureImpl.java:195)
at io.vertx.ext.web.client.impl.HttpContext.lambda$handleSendRequest$15(HttpContext.java:518)
at io.vertx.core.http.impl.HttpClientRequestBase.handleException(HttpClientRequestBase.java:133)
at io.vertx.core.http.impl.HttpClientRequestImpl.handleException(HttpClientRequestImpl.java:371)
at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.handleException(Http1xClientConnection.java:525)
at io.vertx.core.http.impl.Http1xClientConnection$StreamImpl.reset(Http1xClientConnection.java:377)
at io.vertx.core.http.impl.HttpClientRequestImpl.reset(HttpClientRequestImpl.java:294)
at io.vertx.core.http.impl.HttpClientRequestBase.handleTimeout(HttpClientRequestBase.java:195)
at io.vertx.core.http.impl.HttpClientRequestBase.lambda$setTimeout$0(HttpClientRequestBase.java:118)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:942)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:906)
at io.vertx.core.impl.ContextImpl.executeTask(ContextImpl.java:366)
at io.vertx.core.impl.EventLoopContext.execute(EventLoopContext.java:43)
at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:229)
at io.vertx.core.impl.ContextImpl.executeFromIO(ContextImpl.java:221)
at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:932)
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: io.vertx.core.http.impl.NoStackTraceTimeoutException: The timeout period of 5000ms has been exceeded while executing GET /api/health for server cryostat-sample-grafana-bookinfo.apps.cluster:443
Hi @mgohashi , thanks for the report. I think this might be a better fit in the -operator Issues tracker, but we can keep it here for now until we determine the root cause.
The Operator should be deploying the Cryostat containers/pods and pointing those environment variables you've (correctly) identified at them. I think the Operator should be using the Service cluster-internal URL for that and not the externally routable Route URL, but maybe I'm wrong about that.
@ebaron do you have any insight on this? Has any logic about the Service/Route changed lately? Or readiness/liveness probes on the various containers?
Hi @mgohashi, in Cryostat 2.0 the health check is indeed using the Route URL. With the upcoming 2.1 release, this will be done using a host alias to the loopback address. I'm not sure why the health check is failing using the Route in your case, but at least in 2.1 this should be simplified with the health check traffic not leaving the pod.
We expect 2.1 to be available within the next couple weeks.
^ Fixed by https://github.com/cryostatio/cryostat-operator/pull/352
Will leave this open until 2.1 is out and @mgohashi can verify the fix works. Thanks!
@mgohashi Cryostat 2.1 is out and should be available from OperatorHub on your cluster. Please test it out and let us know the result. If you still have 2.0 installed you can upgrade, but you will need to select the "stable" update channel (not "stable-2.0"), and there is a manual upgrade step required:
oc project <cryostat_project>
cryostats=$(oc get cryostat --template \
'{{range .items}}{{.metadata.name}}{{"\n"}}{{end}}')
for cryostat in ${cryostats}; do
oc delete svc,deploy -lapp="${cryostat}"
done
Closing, no follow-up from reporter but we believe this is solved.