presto icon indicating copy to clipboard operation
presto copied to clipboard

Add timeout for query planning

Open feilong-liu opened this issue 2 years ago • 2 comments

What's the problem it fix here?

It's fixing the same issue as https://github.com/prestodb/presto/pull/18099.

When we have huge number of nodes in the logical plan during planning (an example is a long chain of common table expression (CTE) with each table being the join of multiple tables defined in previous CTE), we will have huge memory consumption, which may kill the coordinator.

What's the fix here?

While the pull request https://github.com/prestodb/presto/pull/18099 fix by limiting the maximum number of nodes in query planning, this pull request fix by setting a time limit to query planning which return exception when timeout.

Notice that we already have timeout logic in SqlQueryExecution.java https://github.com/prestodb/presto/blob/77c53070324aa672d44de3c89398c8f85e12a33c/presto-main/src/main/java/com/facebook/presto/execution/SqlQueryExecution.java#L389-L395 for query analysis phase.

However there are two problems: 1) This timeout does not cover explain queries, which will also do logical query planning and faces the same problem here 2) the timeout works by interrupt the planning threads. However, in the query planning phase, we do not have interrupt checking and handling. Hence even if it timeouts, query planning will still keep working without being interrupted.

In this PR, I did the following: 1) add timeout in code path for explain queries 2) add interrupt signal check and handling code in query planning code.

Currently I set the default timeout to be 3 minutes and enabled by default.

What's the test?

I ran some end to end test locally to verify that such query timeouts.

feilong-liu avatar Aug 07 '22 06:08 feilong-liu

Usually parsing and analysis are overlapping stages. getSemanticAnalyzerTimeout only checks for analysis. .

With these changes, now we would have timeout for

  1. Analysis
  2. Optimizer( added as. getQueryAnalyzerTimeout, I wonder if we can rename it to reflect the realities)

Is it possible to add few test cases which catches these timeouts?

jainxrohit avatar Aug 08 '22 17:08 jainxrohit

@jainxrohit

Is it possible to add few test cases which catches these timeouts?

Add unit tests.

  1. Analysis

Covered by the getSemanticAnalyzerTimeout introduced in this PR

  1. Optimizer( added as. getQueryAnalyzerTimeout, I wonder if we can rename it to reflect the realities)

For iterative optimizers, the production already have timeout set for each individual optimizer. In production, getQueryAnalyzerTimeout actually only guards the StatsRecordingPlanOptimizer for now. With this PR, we are to have it also cover planning. ~~We can add interrupt signal check if we want to cover the optimizer phase as well, i.e. make sure the whole optimization process complete within getQueryAnalyzerTimeout~~. Discussed with @kaikalur, include it in a separate PR if we want to have it.

feilong-liu avatar Aug 10 '22 21:08 feilong-liu

cc @kaikalur for review.

feilong-liu avatar Aug 22 '22 21:08 feilong-liu

I suggest override process in RelationPlanner so that we don't need to checkInterrupt everwhere.

@kaikalur Updated PR to override process function to have interrupt check.

feilong-liu avatar Aug 23 '22 18:08 feilong-liu

Can you please fix the PR description and commit message as we discussed?

jainxrohit avatar Aug 23 '22 18:08 jainxrohit

Can you please fix the PR description and commit message as we discussed?

Updated.

feilong-liu avatar Aug 23 '22 21:08 feilong-liu

@rschlussel Addressed the comments.

feilong-liu avatar Aug 24 '22 18:08 feilong-liu

Merged. Thanks @feilong-liu! In the future, also be sure to add a release note in the PR description as per the template (Added one already here)

rschlussel avatar Aug 24 '22 19:08 rschlussel