spline icon indicating copy to clipboard operation
spline copied to clipboard

REST: Async timeout when saving/reading large execution plan data

Open wajda opened this issue 6 years ago • 14 comments

The issue tends to happen on the execution plans that deals with large amount of columns and very complex data types. This produces large JSON that is sent to the Producer and subsequently stored in the DB in single transaction. The bottleneck is presumably on the Arango side, as event though the REST call times out, the execution plan is eventually successfully stored in the DB.

A similar issue occurs on the Consumer REST when a large operation info is requested by the client.

Thinks to consider:

For Agent & Producer:
  • de-duplicate and refine metadata. Probably we store too much stuff unnecessarily.
  • use queuing and polling strategy for long running requests. (could sent plan & event in parallel and then either wait for both or none depending on the agent durability preferences)
  • split execution plan save operation into multiple transactions (parallel?)
For Consumer:
  • make Consumer REST more selective, split operation info call into several smaller calls.
  • introduce loading progress indicator
  • re-introduce adaptive timeouts as were in Spline 0.3
For both:
  • :heavy_check_mark: gzip traffic #615
  • :heavy_check_mark: increase default timeout #615, AbsaOSS/spline-spark-agent#417, #1043

[TBD]

wajda avatar Nov 28 '19 08:11 wajda

Related to #474

wajda avatar Nov 28 '19 08:11 wajda

@DzMakatun

wajda avatar Nov 28 '19 08:11 wajda

@wajda @cerveada Any update on this bug? We are unable to fetch graph in detailed lineage

uday1409 avatar Dec 17 '20 12:12 uday1409

The issue was partially remediated by enabling gzip compression and adding configurable timeout propery. We plan to return to this issue later.

wajda avatar Dec 17 '20 15:12 wajda

But for getting detailed lineage performance should be an issue. Can you give us more stats?

wajda avatar Dec 17 '20 15:12 wajda

@wajda @cerveada Any update or fix for this bug ? getting read timed out exception while the json is getting posted to spline

vchichil97 avatar Jan 18 '22 05:01 vchichil97

Sure, we're going to focus on the performance aspects of the system in the next couple of months, so you might expect upcoming Spline 1.0.0 to be significantly more performant.

But because the overall performance depends on a number of factors including the size and structure of a particular execution plan(s) being ingested into Spline, it would greatly help us if you could provide a sanitized sample of the JSON that leads to an issue, so we could investigate it on our end.

wajda avatar Jan 18 '22 10:01 wajda

@wajda we are facing the below exception when we try to save large lineages .

We are currently using 0.7.5 spline , 0.7.2 spline spark agent.

We tried setting some spring paramters to prevent AsyncRequestTimeout but still could not get rid of this issue . During high load we see lot many requests fail with the below error. Is there a way how we can increase the AsyncRequestTimeout ?

21:27:51.099 [http-apr-8080-exec-25] ERROR za.co.absa.commons.error.ErrorRef - ERROR_ID [debceaf2-f7bd-4216-9333-15371df879d8]
org.springframework.web.context.request.async.AsyncRequestTimeoutException: null
	at org.springframework.web.context.request.async.TimeoutDeferredResultProcessingInterceptor.handleTimeout(TimeoutDeferredResultProcessingInterceptor.java:42)
	at org.springframework.web.context.request.async.DeferredResultInterceptorChain.triggerAfterTimeout(DeferredResultInterceptorChain.java:79)
	at org.springframework.web.context.request.async.WebAsyncManager.lambda$startDeferredResultProcessing$5(WebAsyncManager.java:428)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
	at org.springframework.web.context.request.async.StandardServletAsyncWebRequest.onTimeout(StandardServletAsyncWebRequest.java:151)
	at org.apache.catalina.core.AsyncListenerWrapper.fireOnTimeout(AsyncListenerWrapper.java:44)
	at org.apache.catalina.core.AsyncContextImpl.timeout(AsyncContextImpl.java:135)
	at org.apache.catalina.connector.CoyoteAdapter.asyncDispatch(CoyoteAdapter.java:147)
	at org.apache.coyote.AbstractProcessor.dispatch(AbstractProcessor.java:241)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:889)
	at org.apache.tomcat.util.net.AprEndpoint$SocketProcessor.doRun(AprEndpoint.java:2138)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191)
	at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:750)

abhishekshenoy avatar Mar 17 '22 21:03 abhishekshenoy

I've never tried it myself, but in theory this should help. In the agent project, on the line RestClient.scala:51 add the following:

.header("X-SPLINE-TIMEOUT", readTimeout.toMillis.toString)

and then configure desired timeouts:

spline.lineageDispatcher.http.timeout.read=120000

Please let me know if it works, I'll make according changes in our codebase.

wajda avatar Mar 17 '22 22:03 wajda

@wajda from what i see there is already connection and read timeout being set on the RestClient

    new RestClient {
      override def endpoint(resource: String): RestEndpoint = new RestEndpoint(
        baseHttp(s"$baseURL/$resource")
          .timeout(connectionTimeout.toMillis.toInt, readTimeout.toMillis.toInt)
          .compress(true))
    }

The AsyncRequestTimeout was not allowing any Lineage to get published nor was the UI responsive (Consumer APIs was also failing).

I did the change you mentioned but that was still failing with AsyncRequestTimeout , on restarting the gateway , was able to continue publishing lineages and the UI was also responsive (Consumer APIs)

abhishekshenoy avatar Mar 21 '22 03:03 abhishekshenoy

The problem with the timeout issue was that it's the server that timed out, not the agent. The code above controls the agent side timeout only. On the other hand the REST protocol allows sending the X-SPLINE-TIMEOUT header to request the desired timeout on the server side for the given HTTP request. So basically, with that header you control the server side timeout. The updated agent version has just been released: 0.7.4 (I'll also add a respective max timeout config option on the server to guard it from impractically long timeout requests #1043)

Regarding the second part of your question, well there is a reason why the default 30 sec timeout wasn't enough for you. If the lineage metadata is too large or complex, and/or the database has grown too big then timeout issues and UI unresponsiveness might be expected. We are working on optimizing Spline server performance, and you might expect upcoming 1.0.0 version to perform better than 0.7.x. But as you know performance tuning is multi-sided problem. To be able to answer the question why it performs poorly for you, we need to know details about your use-case: what do you execution plans look like in average, what does you load pattern look like (how often jobs are executed, how many records in the database etc), what is the cluster configuration, how much resource do your nodes have, the replication factor, etc etc etc

wajda avatar Mar 21 '22 11:03 wajda

@wajda Is there a way to configure the timeout properties on the Producer API server? Perhaps there is an argument that can be supplied or an environment variable. I am running the REST server in a Docker container. Thanks.

mi-my-sf avatar Jun 16 '22 20:06 mi-my-sf

@mi-my-sf ,

spline.[consumer|producer].timeoutDefault - default async request timeout if one isn't specified by the client spline.[consumer|producer].timeoutMaximum - async timeout threshold that is applied to both default and user requested timeout value.

Both properties are optional. By default, timeout is 2 minutes, and the maximum one is 1 hour

Also see the following comment https://github.com/AbsaOSS/spline-spark-agent/issues/431#issuecomment-1123360177

wajda avatar Jun 17 '22 12:06 wajda

The environment variable would be respectively: SPLINE_PRODUCER_TIMEOUT_DEFAULT etc.

wajda avatar Jun 17 '22 12:06 wajda