spline
spline copied to clipboard
REST: Async timeout when saving/reading large execution plan data
The issue tends to happen on the execution plans that deals with large amount of columns and very complex data types. This produces large JSON that is sent to the Producer and subsequently stored in the DB in single transaction. The bottleneck is presumably on the Arango side, as event though the REST call times out, the execution plan is eventually successfully stored in the DB.
A similar issue occurs on the Consumer REST when a large operation info is requested by the client.
Thinks to consider:
For Agent & Producer:
- de-duplicate and refine metadata. Probably we store too much stuff unnecessarily.
- use queuing and polling strategy for long running requests. (could sent plan & event in parallel and then either wait for both or none depending on the agent durability preferences)
- split execution plan save operation into multiple transactions (parallel?)
For Consumer:
- make Consumer REST more selective, split operation info call into several smaller calls.
- introduce loading progress indicator
- re-introduce adaptive timeouts as were in Spline 0.3
For both:
- :heavy_check_mark: gzip traffic #615
- :heavy_check_mark: increase default timeout #615, AbsaOSS/spline-spark-agent#417, #1043
[TBD]
Related to #474
@DzMakatun
@wajda @cerveada Any update on this bug? We are unable to fetch graph in detailed lineage
The issue was partially remediated by enabling gzip compression and adding configurable timeout propery. We plan to return to this issue later.
But for getting detailed lineage performance should be an issue. Can you give us more stats?
@wajda @cerveada Any update or fix for this bug ? getting read timed out exception while the json is getting posted to spline
Sure, we're going to focus on the performance aspects of the system in the next couple of months, so you might expect upcoming Spline 1.0.0 to be significantly more performant.
But because the overall performance depends on a number of factors including the size and structure of a particular execution plan(s) being ingested into Spline, it would greatly help us if you could provide a sanitized sample of the JSON that leads to an issue, so we could investigate it on our end.
@wajda we are facing the below exception when we try to save large lineages .
We are currently using 0.7.5 spline , 0.7.2 spline spark agent.
We tried setting some spring paramters to prevent AsyncRequestTimeout but still could not get rid of this issue . During high load we see lot many requests fail with the below error. Is there a way how we can increase the AsyncRequestTimeout ?
21:27:51.099 [http-apr-8080-exec-25] ERROR za.co.absa.commons.error.ErrorRef - ERROR_ID [debceaf2-f7bd-4216-9333-15371df879d8]
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:79)
at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:428)
at java.util.ArrayList.forEach(ArrayList.java:1259)
at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:151)
at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:135)
at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:147)
at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:241)
at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59)
at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:889)
at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2138)
at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
at java.lang.Thread.run(Thread.java:750)
I've never tried it myself, but in theory this should help.
In the agent project, on the line RestClient.scala:51
add the following:
.header("X-SPLINE-TIMEOUT", readTimeout.toMillis.toString)
and then configure desired timeouts:
spline.lineageDispatcher.http.timeout.read=120000
Please let me know if it works, I'll make according changes in our codebase.
@wajda from what i see there is already connection and read timeout being set on the RestClient
new RestClient {
override def endpoint(resource: String): RestEndpoint = new RestEndpoint(
baseHttp(s"$baseURL/$resource")
.timeout(connectionTimeout.toMillis.toInt, readTimeout.toMillis.toInt)
.compress(true))
}
The AsyncRequestTimeout was not allowing any Lineage to get published nor was the UI responsive (Consumer APIs was also failing).
I did the change you mentioned but that was still failing with AsyncRequestTimeout , on restarting the gateway , was able to continue publishing lineages and the UI was also responsive (Consumer APIs)
The problem with the timeout issue was that it's the server that timed out, not the agent. The code above controls the agent side timeout only. On the other hand the REST protocol allows sending the X-SPLINE-TIMEOUT header to request the desired timeout on the server side for the given HTTP request. So basically, with that header you control the server side timeout. The updated agent version has just been released: 0.7.4
(I'll also add a respective max timeout config option on the server to guard it from impractically long timeout requests #1043)
Regarding the second part of your question, well there is a reason why the default 30 sec timeout wasn't enough for you. If the lineage metadata is too large or complex, and/or the database has grown too big then timeout issues and UI unresponsiveness might be expected. We are working on optimizing Spline server performance, and you might expect upcoming 1.0.0 version to perform better than 0.7.x. But as you know performance tuning is multi-sided problem. To be able to answer the question why it performs poorly for you, we need to know details about your use-case: what do you execution plans look like in average, what does you load pattern look like (how often jobs are executed, how many records in the database etc), what is the cluster configuration, how much resource do your nodes have, the replication factor, etc etc etc
@wajda Is there a way to configure the timeout properties on the Producer API server? Perhaps there is an argument that can be supplied or an environment variable. I am running the REST server in a Docker container. Thanks.
@mi-my-sf ,
spline.[consumer|producer].timeoutDefault - default async request timeout if one isn't specified by the client
spline.[consumer|producer].timeoutMaximum - async timeout threshold that is applied to both default and user requested timeout value.
Both properties are optional. By default, timeout is 2 minutes, and the maximum one is 1 hour
Also see the following comment https://github.com/AbsaOSS/spline-spark-agent/issues/431#issuecomment-1123360177
The environment variable would be respectively:
SPLINE_PRODUCER_TIMEOUT_DEFAULT etc.