presto
presto copied to clipboard
Add timeout for query planning
What's the problem it fix here?
It's fixing the same issue as https://github.com/prestodb/presto/pull/18099.
When we have huge number of nodes in the logical plan during planning (an example is a long chain of common table expression (CTE) with each table being the join of multiple tables defined in previous CTE), we will have huge memory consumption, which may kill the coordinator.
What's the fix here?
While the pull request https://github.com/prestodb/presto/pull/18099 fix by limiting the maximum number of nodes in query planning, this pull request fix by setting a time limit to query planning which return exception when timeout.
Notice that we already have timeout logic in SqlQueryExecution.java https://github.com/prestodb/presto/blob/77c53070324aa672d44de3c89398c8f85e12a33c/presto-main/src/main/java/com/facebook/presto/execution/SqlQueryExecution.java#L389-L395 for query analysis phase.
However there are two problems: 1) This timeout does not cover explain queries, which will also do logical query planning and faces the same problem here 2) the timeout works by interrupt the planning threads. However, in the query planning phase, we do not have interrupt checking and handling. Hence even if it timeouts, query planning will still keep working without being interrupted.
In this PR, I did the following: 1) add timeout in code path for explain queries 2) add interrupt signal check and handling code in query planning code.
Currently I set the default timeout to be 3 minutes and enabled by default.
What's the test?
I ran some end to end test locally to verify that such query timeouts.
Usually parsing and analysis are overlapping stages. getSemanticAnalyzerTimeout only checks for analysis. .
With these changes, now we would have timeout for
- Analysis
- Optimizer( added as. getQueryAnalyzerTimeout, I wonder if we can rename it to reflect the realities)
Is it possible to add few test cases which catches these timeouts?
@jainxrohit
Is it possible to add few test cases which catches these timeouts?
Add unit tests.
- Analysis
Covered by the getSemanticAnalyzerTimeout
introduced in this PR
- Optimizer( added as. getQueryAnalyzerTimeout, I wonder if we can rename it to reflect the realities)
For iterative optimizers, the production already have timeout set for each individual optimizer.
In production, getQueryAnalyzerTimeout
actually only guards the StatsRecordingPlanOptimizer
for now. With this PR, we are to have it also cover planning. ~~We can add interrupt signal check if we want to cover the optimizer phase as well, i.e. make sure the whole optimization process complete within getQueryAnalyzerTimeout
~~. Discussed with @kaikalur, include it in a separate PR if we want to have it.
cc @kaikalur for review.
I suggest override process in RelationPlanner so that we don't need to checkInterrupt everwhere.
@kaikalur Updated PR to override process function to have interrupt check.
Can you please fix the PR description and commit message as we discussed?
Can you please fix the PR description and commit message as we discussed?
Updated.
@rschlussel Addressed the comments.
Merged. Thanks @feilong-liu! In the future, also be sure to add a release note in the PR description as per the template (Added one already here)