presto
presto copied to clipboard
Throw exception when number of nodes in logical plan exceeds threshold during query planning
What is the problem?
When we have huge number of nodes in the logical plan during planning (an example is a long chain of common table expression (CTE) with each table being the join of multiple tables defined in previous CTE), we will have huge memory consumption, which may kill the coordinator.
Edit: Added another pull request https://github.com/prestodb/presto/pull/18150, which adds timeout to query planning. It also solves the same problem here, although from a different way. We can choose to keep both or pick one.
What is the fix?
In this pull request, I add a new context (SqlPlannerContext) to the logical query plan generation. It records the number of nodes during query planning. I also added two session properties to control 1) whether to enable this exception 2) the threshold to trigger the threshold.
If this changes looks good, the next plan is to add logging to record the current maximum number of nodes in each, this will help to set the threshold if we want to enable this feature.
Notice that in this diff, I am not tracking the accurate number of nodes in the logical plan, but a rough estimate. This is because tracking the accurate number is not necessary, we are only using this number as a safety guard to those nonsense huge plans.
In this diff, I am tracking the number of table scan and values nodes. Ideally I should also include other operators, for example join, projection, union etc. However, as the code which adds these operators scattered among many functions, I am not sure if it makes sense to do this exhaustively to cover all possible nodes. The current approach is actually counting the number of leaf nodes in the plan. This is enough to cover the example mentioned in the problem section. And it's unlikely that we will have huge plans with tens of thousands levels in query plan tree.
Test plan
Add unit test. Also run test query locally.
Nice addition of SqlPlannerContext! Should help with future enhancements.
Added another pull request https://github.com/prestodb/presto/pull/18150, which adds timeout to query planning. It also solves the same problem here, although from a different way.
If you want to track # nodes instead of # leaves, override process(node, context)
in RelationPlanner. This would be called once for every node(maybe a bit more than that in case of subqueries, but will be more "accurate" in terms of total plan size). Leaves is a good heuristic for CTEs nonetheless.
@rschlussel @kaikalur @pranjalssh The PR is ready for review. The failed test test native / build-and-test (pull_request)
is a known problem and not related to this PR.
LGTM, just add a release note. Accepting in advance
LGTM, just add a release note. Accepting in advance
Added release notes.