datafusion
datafusion copied to clipboard
Fix TreeNode transform and rewrite APIs
Which issue does this PR close?
Part of https://github.com/apache/arrow-datafusion/issues/8913
Rationale for this change
-
The
TreeNode.rewrite()/TreeNodeRewriteris very inconsistent with otherTreeNodeapply / visit functions:TreeNodeRewriter.pre_visit()currently returnsRewriteRecursionso as to control the recursion but that enum is very inconsistent with theVisitRecursionenum that visit functions use.RewriteRecursion::Stopfunctionality is equal toVisitRecursion::Skip, that is to skip (prune) the subtree (don't recurse into children and don't run post-order functions).RewriteRecursion::Skipdoesn't prevent recursing into children. This is different toVisitRecursion::Skipwhere recursion into children is not executed.VisitRecursion::Stopfully stops recursion which functionality isn't available withRewriteRecursion.- The transformation defined in
TreeNodeRewriter.mutate()is sometimes a top-down (pre-order) but other times a bottom-up (post-order) function depending onTreeNodeRewriter.pre_visit()result.
This PR removes
RewriteRecursionand renamesVisitRecursionto a commonTreeNodeRecursionto control all (apply / visit / rewrite / transform) APIs. PreviousVisitRecursion::Skipgot renamed toJumpand its functionality is now extended to bottom-up and combined traversals as follows:pub enum TreeNodeRecursion { /// Continue recursion with the next node. Continue, /// In top-down traversals skip recursing into children but continue with the next /// node, which actually means pruning of the subtree. /// In bottom-up traversals bypass calling bottom-up closures till the next leaf node. /// In combined traversals bypass calling bottom-up closures till the next top-down /// closure. Jump, /// Stop recursion. Stop, }and changes
TreeNodeRewriterto incorporate anf_downand a nf_upmethods that both can change the node and both can use the commonTransformedstruct (containing aTreeNodeRecursionenum) to control how to proceed with recursion:pub trait TreeNodeRewriter: Sized { /// The node type which is rewritable. type Node: TreeNode; /// Invoked while traversing down the tree before any children are rewritten / /// visited. /// Default implementation returns the node unmodified and continues recursion. fn f_down(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> { Ok(Transformed::no(node) } /// Invoked while traversing up the tree after all children have been rewritten / /// visited. /// Default implementation returns the node unmodified. fn f_up(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> { Ok(Transformed::no(node) } }This solution is capable to replace all
RewriteRecursionfunctionalities and make theTreeNodeRewriters not just cleaner but more versatile. -
This PR modifies the
fn transform<F>(self, op: &F) -> Result<Self>, that is currently an alias totransform_up(), to a more versatile:fn transform<FD, FU>( self, f_down: &mut FD, f_up: &mut FU, ) -> Result<Transformed<Self>> where FD: FnMut(Self) -> Result<Transformed<Self>>, FU: FnMut(Self) -> Result<Transformed<Self>>,- Sometimes we don't need a full blown
TreeNodeRewriterbut defining 2 closures for top-down and bottom-up transformations is just enough. - I think that that the missing direction definition of current
transform()makes is a bit confusing and explicitely usingtransform_down()/transform_up()is more robust. So we could usetransform()for the 2 closure transformation.
- Sometimes we don't need a full blown
-
This PR changes the
Transformedenum the following struct:pub struct Transformed<T> { pub data: T, pub transformed: bool, pub tnr: TreeNodeRecursion, } -
This PR modifies the
transform_down()andtransform_up()methods to use the newTransformedstruct.
Please note that this PR implements 2. from https://github.com/apache/arrow-datafusion/pull/7942 and fixes the Transformed enum issue: https://github.com/apache/arrow-datafusion/pull/8835
What changes are included in this PR?
This PR:
- Removes
RewriteRecursion, - Renames
VisitRecursiontoTreeNodeRecursion. The previousSkipelement is renamed toJumpand bottom-up functionality is added. - Modifies
TreeNodeRewriterto incorporate anf_downandf_upmethods and useTransformed/TreeNodeRecursion, - Modifies
TreeNode.transform()to accept anf_downandf_upclosures and useTransformed/TreeNodeRecursion. - Modifies
Transformedenum to a new struct. - Modifies
TreeNode.transform_down()andTreeNode.transform_up()to use theTransformedstruct. - Alligns all
apply/visit/transform/rewriteAPIs to usef,f_down,f_upclosure parameter names.
Are these changes tested?
Existng tests.
Are there any user-facing changes?
Yes.
cc @alamb, @andygrove, @yahoNanJing, @berkaysynnada this PR is another step to cleaner TreeNode APIs.
This PR:
Removes RewriteRecursion, Renames VisitRecursion to TreeNodeRecursion, Modifies TreeNodeRewriter to incorporate an f_down and f_up methods and use TreeNodeRecursion
These changes seem reasonable to me and, since they propose a similar approach to other recursive flows, they make understanding easier.
However, I am not sure about this change:
Modifies TreeNode.transform() to accept an f_down and f_up closures and use TreeNodeRecursion.
transform() is a simple method for traversing the graph with given self-concerned rules, being not interested in the type of traversal. This new version forces awareness of the traversal order, and there is also an implicit convention of traversing first bottom-up, then top-down.
Thanks for the feedback @berkaysynnada.
transform()is a simple method for traversing the graph with given self-concerned rules, being not interested in the type of traversal. This new version forces awareness of the traversal order, and there is also an implicit convention of traversing first bottom-up, then top-down.
I feel that a simple (simpler than than TreeNodeRewriter) API that is also capable to do both direction transformation in 1 traversal would be useful but I agree, it doesn't necessarily need to replace the existing transform().
Also, just to add a bit more details to this part of the PR: While I was local testing this PR I accidentally changed all transform() to transform_down() (so not to its current aliased transform_up() implementation) and run into stack overflow errors in some tests. This suggested me that despite transform() should be used only with closures that don't require any specific direction, some invocation do require transform to do its currently defined bottom-up traversal. I'm sure that most of the transform() usescases are direction agnostic, but I feel it would be less error prone if we required explicit direction from API users.
and there is also an implicit convention of traversing first bottom-up, then top-down.
I'm not sure I get this part.
I'm not sure I get this part.
The docstring you have updated for transform() states that the order is:
1) f_down(ParentNode)
2) f_down(ChildNode1)
3) f_up(ChildNode1)
4) f_down(ChildNode2)
5) f_up(ChildNode2)
6) f_up(ParentNode)
I meant what if we intend to do such:
1) f_up(ChildNode1)
2) f_up(ChildNode2)
3) f_up(ParentNode)
4) f_down(ParentNode)
5) f_down(ChildNode1)
6) f_down(ChildNode2)
If we want to offer such a use, I think we may need to consider this way as well.
I'm not sure I get this part.
The docstring you have updated for
transform()states that the order is:1) f_down(ParentNode) 2) f_down(ChildNode1) 3) f_up(ChildNode1) 4) f_down(ChildNode2) 5) f_up(ChildNode2) 6) f_up(ParentNode)
This is correct. This is because the proposed new transform() does only one traversal. It is exactly the same order how the old and new TreeNode::rewrite / TreeNodeRewriter worked and works.
If someone needs f_down to process all nodes atop-down nd then f_up to process all nodes bottom-up (or vice versa) then they need 2 traversals with subsequent explicit transform_down() / transform_up() calls.
This is correct. This is because the proposed new
transform()does only one traversal. It is exactly the same order how the old and newTreeNode::rewrite/TreeNodeRewriterworked and works.
I checked again. Yes, the order is correct. What I really meant is that when we use transform(), it is assumed that we first pass from top to bottom and then from bottom to top. But if we want to do the opposite, i.e. first from bottom to top and then from top to bottom, we cannot use this transform(), am I right?.
Since the nodes access their children, first top-down and then bottom-up traversals seem to be a single pass, doing it in reverse order is like 2 passes, but as a behavior, it may be desirable to have reverse order and should we provide a way for this?
This is correct. This is because the proposed new
transform()does only one traversal. It is exactly the same order how the old and newTreeNode::rewrite/TreeNodeRewriterworked and works.I checked again. Yes, the order is correct. What I really meant is that when we use
transform(), it is assumed that we first pass from top to bottom and then from bottom to top. But if we want to do the opposite, i.e. first from bottom to top and then from top to bottom, we cannot use thistransform(), am I right?.
That's correct. I don't think that botom-up and then top-down can't be achieved in one run.
Since the nodes access their children, first top-down and then bottom-up traversals seem to be a single pass, doing it in reverse order is like 2 passes, but as a behavior, it may be desirable to have reverse order and should we provide a way for this?
API users can explicitly call transform_up() and then transform_down() (2 pass) if this is needed. Do you have a real use case for this?
API users can explicitly call
transform_up()and thentransform_down()(2 pass) if this is needed. Do you have a real use case for this?
Since we set out to simplify and provide easy to use TreeNode API and its related implementations as much as possible, IMO we need to reach a simple state at the first step. As far as I observed, there is no use for this version of transform(), and it mixes the VisitRecursion logic with transform logic (early return option is a newly introduced feature for transform() methods). It may make sense to add such things when they are needed.
Since we set out to simplify and provide easy to use TreeNode API and its related implementations as much as possible, IMO we need to reach a simple state at the first step. As far as I observed, there is no use for this version of
transform(), and it mixes theVisitRecursionlogic with transform logic (early return option is a newly introduced feature fortransform()methods). It may make sense to add such things when they are needed.
Although I used the new transform only once in my PR to replace and simplify a rewrite() (https://github.com/apache/arrow-datafusion/pull/8891/files#diff-f69e720234d9da6cb3a4a178d3a5575fcd34f68191335a8c2465a172e8c4e6f1R653) I plan to use it in the future. But I'm ok with droping 2. from this PR. Let's hear some more feedback on this.
Hi @peter-toth -- thank you for this PR and the various others that are trying to improve the quality of code in DataFusion
I think one thing that is challenging for me is to understand how to evaluate the benefits of this and related PRs in an objective manner.
This PR, for example, may improve the consistency of APIs (which I happen to think it does) but making this change will also come at a cost for DataFusion users:
- If they use the
TreeNodeAPI they will have to change their code on upgrade - If they were familiar with the existing TreeNode API, they will have to learn the new API
For other breaking API changes, there are often other benefits that come with the cost, such as new features or improved performance.
I wonder if you can help us articulate how this (and the related refactoring PRs) help DataFusion users -- for example do they
- permit us to implement Expression or plan node rewrites more efficiently?
- Or will they allow us to consolidate / reduce code to reduce the maintenance burden to DataFusion maintainers
If we can connect this work to a benefit I suspect we would be able to drive it forward more quickly
Thanks for the question @alamb! I'm new to DataFusion so this discussion could help me a lot to understand what ways are viable in terms of improving the existing code. I came from a different open-source query engine but I think TreeNode APIs are fundamental for each engine. While I was getting familiar with DataFusion I stumbled upon quite a few inconsistencies and missing features regarding these APIs. I was trying to throw in some ideas with my PRs that could help improving the situation. Some of these PRs are simply trying to improve consistency some others are performance related.
Here are a few things I found weird with the current APIs. These can cause a newcommer (like me) some confusion:
-
The
apply/visitfunctions are capable to prune (skip) subtrees or return immediately (stop) but thetransformfunctions are not. To make it more confusionrewriteis also capable to to prune, but instead of using a common way to control recursion,visitandrewriteuse different enums with conflicting semantics. See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for details onVisitRecursionvsRewriteRecursion. -
There is this
Transformedenum whose purpose is to explicitely define if any change was made to a node. This enum is used intransformclousres but for some reason it is missing fromrewrite. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes. See details in https://github.com/apache/arrow-datafusion/pull/8835. I believe the current state simply doesn't make sense and just causes confusion. -
rewritealso seems to be inconsistent withtransform_upandtransform_down.rewriteprobably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures fromtransform_upandtransform_downto define:- what transformation should be done before (pre-order) and after (post-order) transformation of the node's children
- and how the recursion should continue .
See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the details. IMO the current
TreeNodeRewriter.pre_visit()seems like a mess and its logic is hard to grasp.
What features I miss from the current API:
- Pruning capability of
transformfunctions: Pruning would be very important as many of the tranformations wouldn't require traversing on the whole tree and could improve performance a lot. - Payload propagation:
I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In https://github.com/apache/arrow-datafusion/pull/8664 with the help of
transform_with_payloadI not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement).
So to recap my ideal TreeNode transformation API would look like this:
/// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a
/// node while traversing the tree top-down (pre-order, before the node's children are
/// visited) while `f_up` is applied on a node while traversing the tree bottom-up
/// (post-order, after the the node's children are visited).
///
/// The `f_down` closure takes
/// - a `PD` type payload from its parent
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `PC` type payload to pass to `f_up`,
/// - a `Vec<FD>` type payload to propagate down to its children
/// (one `FD` element is propagated down to each child),
/// - a `TreeNodeRecursion` enum element to control recursion flow.
///
/// The `f_up` closure takes
/// - a `PC` type payload from `f_down` and
/// - a `Vec<PU>` type payload collected from its children
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `FU` type payload to propagate up to its parent,
/// - a `TreeNodeRecursion` enum element to control recursion flow.
fn transform_with_payload<FD, PD, PC, FU, PU>(
self,
f_down: &mut FD,
payload_down: PD,
f_up: &mut FU,
) -> Result<(Self, PU)>
where
FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>,
FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>,
Obviously the closure return types can be extracted to a type alias or strucs like in the case of f_down could be:
pub struct TransformDownResult<N, PD, PC> {
pub transformed_node: N,
pub payload_to_children: Vec<PD>,
pub payload_to_f_up: PC,
pub recursion_control: TreeNodeRecursion,
}
All the other functions should be just a specialization of the above:
apply(visit_down): Onlyf_downclosure that doesn't return a modified node and payload (only retrurnsTreeNodeRecursion)visit/TreeNodeVisitor: Bothf_downandf_upclosures needed. The closures can be incorporated into aTreeNodeVisitorobject but they don't return a modified node and payloadtransform_down: Onlyf_downclosure that doesn't return payloadtransform_up: Onlyf_upclosure that doesn't return payloadtransform: Bothf_downandf_upclosures needed, but they don't return payloadsrewrite/TreeNodeRewriter: Bothf_downandf_upare incorporated into aTreeNodeRewriterobject, but they don't return a payloadtransform_down_with_payload: Onlyf_downclosuretransform_up_with_payload: Onlyf_upclosure
I believe all these PRs would take us forward to a simple but still versarile APIs that have/are:
- performance benefits (e.g. pruning during transformation + improved algorithms in https://github.com/apache/arrow-datafusion/pull/8664),
- reduced code maintenance costs (e.g. see less dereived trees in https://github.com/apache/arrow-datafusion/pull/8664)
- consistent and easy to grasp (e.g. this PR + https://github.com/apache/arrow-datafusion/pull/8835)
Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where:
- we don't touch any existing API (maybe deprecate old ones)
- or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition.
I see -- thank you @peter-toth. Your explanation makes sense.
As a historical note, the current TreeNode API came out of unifying different in inconsistent APIs across the Exprs, LogicalPlan , ExectionPlan and PhysicalExprs. It was a large step forward at the time, but also just getting a single API was a major step forward.
Here are a few things I found weird with the current APIs. These can cause a newcommer (like me) some confusion:
I think this is a good insight -- basically that the rationale for this change is to make it easier for new contributors to contribute to DataFusion as they APIs are more uniform.
Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where:
we don't touch any existing API (maybe deprecate old ones) or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition.
I think implementing the "one true API" as you have proposed above and then migrating the rest of the codebase to use it (possibly deprecating the old APIs or changing them one at at time) would be the least disruptive.
To that end, what I will do is to create a new ticket describing this work so it is not lost in this ticket, and then we can use that ticket to make the overall plan visible
I filed https://github.com/apache/arrow-datafusion/issues/8913 -- let me know what you think. What do you think about creating a PR with transform_with_payload and then a PR showing how to use it to improve one of the existing passes (or implement TreeNode::rewrite in terms of it or something?)
I filed #8913 -- let me know what you think. What do you think about creating a PR with
transform_with_payloadand then a PR showing how to use it to improve one of the existing passes (or implement TreeNode::rewrite in terms of it or something?)
Thanks @alamb! I will share my thoughts regarding transform_with_payload there. Actually I wanted to wait with transform_with_payload (https://github.com/apache/arrow-datafusion/pull/8664) until @ozankabak finishes https://github.com/apache/arrow-datafusion/pull/8817 as those 2 PRs are conflicting (https://github.com/apache/arrow-datafusion/pull/8664#issuecomment-1884878223) so it is good that we have dedicated ticket to discuss the 2 solutions.
The API inconsitency of the rewrite API (see 1. in PR description) and the poposed new transform semantics (see 2.) don't requite transform_with_payload as a base API and this PR doesn't conflict with https://github.com/apache/arrow-datafusion/pull/8664 or https://github.com/apache/arrow-datafusion/pull/8817 so I think this PR is ready for review and feedback is welcome.
So basically I trust you two to come up with something good. I would be happy to merge this PR, or wait for #8817 and evaluate first.
Yes -- let's first get #8817 in and then I and @berkaysynnada will review and go through this in detail and the others in detail
So basically I trust you two to come up with something good. I would be happy to merge this PR, or wait for #8817 and evaluate first.
Yes -- let's first get #8817 in and then I and @berkaysynnada will review and go through this in detail and the others in detail
Sounds good to me. I left a comment https://github.com/apache/arrow-datafusion/pull/8817#issuecomment-1902666804 on https://github.com/apache/arrow-datafusion/pull/8817. It can help with further steps of the epic: https://github.com/apache/arrow-datafusion/issues/8913 Thanks @alamb, @ozankabak, @berkaysynnada.
Thanks everyone. I will carefully review this PR after #8817 is merged and the related updates are done. Thank you again @peter-toth for your efforts!
@alamb, @ozankabak, @berkaysynnada I rebased this PR after https://github.com/apache/arrow-datafusion/pull/8817 got merged. Please share your thoughts.
One thing that raises my "why is it like that?" instinct is the distinction between return types of down and up functions. Is there a specific reason we'd have such an asymmetry?
Oh yes, there is. The only enum element that would make sense in f_up is TreeNodeRecursion::Stop to fully stop recursion immediately (Continue is the default behaviour, and Skip doesn't make sense as we already visited/transformed childrens).
Currently we don't support TreeNodeRecursion::Stop in f_down either. (Maybe I forgot to document that.) This is because to support Stop we would need TreeNode.rewrite() / TreeNode.map_children() to be able to propagate Stop up. (TreeNode::rewrite() / TreeNode::map_children() would need to return TreeNodeRecursion besides the transformed node similarly to TreeNode::visit() / TreeNode::apply_children(), that returns TreeNodeRecursion, and so Stop is supported there.)
To sum up, I think we can add TreeNodeRecursion to f_up in the future so as to make f_down and f_up symmetric, but I would rather do that in a follow-up PR and I'm not sure we need that feature now.
I see. Since we are doing a major refactor of the API, it may make sense to do certain things now instead of having yet another API change at some point.
If there is no technical roadblock to trying it out, it'd be great to see how the code looks when symmetrized. I'd be happy to review it if you want to try it, or I can try myself as well
I see. Since we are doing a major refactor of the API, it may make sense to do certain things now instead of having yet another API change at some point.
If there is no technical roadblock to trying it out, it'd be great to see how the code looks when symmetrized. I'd be happy to review it if you want to try it, or I can try myself as well
I'm happy to work on this but I think it would require significant change in Expr::map_children and since supporting Stop in rewrite is a completely new feature that we don't know if we ever need please consider adding it in a separete follow-up PR.
Anyways, let me know if symmetry is a must have for this PR and I can add it next week.
Actually, one more thing came into my mind. There is this issue with Transformed enum: https://github.com/apache/arrow-datafusion/pull/8835 and if we want to keep and fix it (see conversation on the ticket) it would also require change in TreeNode::rewrite() / TreeNodeRewriter signature.
If for some reason we don't want to make API changes incrementally then we might need incorporate that one as well.
IMO there is one more option we could do: because both supporting Stop in transform and making Transformed useful are new features we could also change the APIs to its final form in this PR (with documenting the limitations) and add implementations in follow-up PRs.
@ozankabak, please let me know your thoughts on the above. Do you think it is ok to add TreeNodeRecursion::Stop support and make TreeNodeRewriter symmetric in a follow-up PR?
I don't think we should make two API changes. @berkaysynnada will do some exploring and collaborate with you to get this PR to the finish line
I don't think we should make two API changes. @berkaysynnada will do some exploring and collaborate with you to get this PR to the finish line
Actually, I plan to make more fixes/changes to the TreeNode API in a few more PRs, just didn't want to overwhelm you with PRs to review. Is it necessary to have all changes in one PR or is it ok to land these PRs in one DataFusion release?
I think there are good natural boundaries here. We can think about the Transformed usage and the TreeNodeRecursion enum in one PR (this one), and then think about the remaining steps in a subsequent PR.
I think there are good natural boundaries here. We can think about the
Transformedusage and theTreeNodeRecursionenum in one PR (this one), and then think about the remaining steps in a subsequent PR.
Ok, then let me update the PR with those.
I would like to review and collaborate on this PR. Could you please notify me when it is ready?
I would like to review and collaborate on this PR. Could you please notify me when it is ready?
Sure, let me get back to you today or tomorrow. And thanks for the help!