optd
optd copied to clipboard
The Devil of MergeGroup
Why we have MergeGroup in the first place?
For both testing the optd cascades core and shrinking the search space, we were trying to add heuristic rules in cascades. However, some simple rules like eliminateFilter which eliminate filter node when the filter predicates are always true lead to the situation that we have to add a MergeGroup in optd core.
// memo before eliminateFilter
Group(n): Filter(true) -> child(Group(n+1))
Group(n+1): expr
// memo after eliminateFilter
Group(n): Filter(true) -> child(Group(n+1)), placeholder(Group(n+1))
Group(n+1): expr
// The result returned by eliminateFilter means that Group(n) and Group(n+1) are actually logically equivalent. So we add MergeGroup in optd to merge these two groups.
The devil of MergeGroup
There's no MergeGroup definition in any cascades paper nor columbia paper and corresponding implementation. The adding of mergeGroup is actually pretty experimental. Besides, in real world optimizer, there are always rewrite phase/prepare phase before the cascades framework to make the expressions normalized. There's actually no need to do such heuristic rules in cascades.
The introduction of MergeGroup leads to two problems: loops in operator tree.
loops in operator tree
Take Select * from t1 where true
as an example. The initial groups are:
// memo
G1: filter(true) -> child(G2)
G2: Scan (t1)
After eliminateFilter in cascades rule, it becomes:
// memo
G1: filter(true)->child(G2), Scan(t1)
G2: merged to G1, reduceGroupId=G1
It is ok when we execute the query as it only cares the winner. But when we wants to traverse the operator tree somehow, it goes into loops for expressions like filter(true)->child(G2)
as they are expressions points to themselves as a child.
It makes tricky bugs for some logic other than execution, for example, join order enumeration, tree traversal, debug printing and so on.
A tricky case is that will it make firing tasks out of order?
In optd, the order of the tasks fired are carefully designed following the columbia paper.
We first fire OptimizeGroup
task for the root/top/first group, it then traverse all the expressions in the task and fire OptimizeExpr
task for all the logical exprs in the group, and OptimizeInput
task for all the physical exprs in the group.
OptimizeExpr
first call ExploreGroup
tasks for the expr's children and then traverse the rules and fire ApplyRule
task for the matched rules for expr.
Newly generated logical exprs are fired with new OptimizeExpr
as they are not originally in the group when optimizeGroup
or ExploreGroup
is called.
OptimizeInput
is responsible to calculate the costs for the expr and update winner for the group, it fires optimizeGroup
for the children group when the children group haven't have a winner. But except for the root/top/first group, other groups have to be the children node of the physical expression in OptimizeInput
can it be called with optimizeGroup
.
In abstract, the cascades framework expect all the groups and their children are fully explored (apply all matching rules) from top group to the leaf group, and the costs and winners are updated then from the top group's physical expression to the children group.
If we merge two groups, these groups exist before merge (as rules can not create new group themselves, they can only return existing group). They are either being called with exploreGroup/OptimizeGroup or not. If they are called with exploreGroup before, we can not explore the group again.(exploreGroup skip groups following the groupId, though they are merged, the group Id remains the same). And optimizeGroup can only be called for root group or children group of physical expressions in parent group.
So if the merge destination group is explored before, the merge source group is not explored and both groups are not being called with optimizeGroup(it is either not the children group of physical expressions or pruned if we adds pruning), it will miss newly added expressions from merge source group. 🤯🤯🤯 It is a very trivial case and hard to be found.
Future task
MergeGroup will lead to debugging nightmare and tricky problems. We may need to remove it. Besides, if we don't add heuristic rules into cascades logic, we really don't need it.