datafusion
datafusion copied to clipboard
Fix TreeNode transform and rewrite APIs
Which issue does this PR close?
Part of https://github.com/apache/arrow-datafusion/issues/8913
Rationale for this change
-
The
TreeNode.rewrite()
/TreeNodeRewriter
is very inconsistent with otherTreeNode
apply / visit functions:-
TreeNodeRewriter.pre_visit()
currently returnsRewriteRecursion
so as to control the recursion but that enum is very inconsistent with theVisitRecursion
enum that visit functions use. -
RewriteRecursion::Stop
functionality is equal toVisitRecursion::Skip
, that is to skip (prune) the subtree (don't recurse into children and don't run post-order functions). -
RewriteRecursion::Skip
doesn't prevent recursing into children. This is different toVisitRecursion::Skip
where recursion into children is not executed.VisitRecursion::Stop
fully stops recursion which functionality isn't available withRewriteRecursion
. - The transformation defined in
TreeNodeRewriter.mutate()
is sometimes a top-down (pre-order) but other times a bottom-up (post-order) function depending onTreeNodeRewriter.pre_visit()
result.
This PR removes
RewriteRecursion
and renamesVisitRecursion
to a commonTreeNodeRecursion
to control all (apply / visit / rewrite / transform) APIs. PreviousVisitRecursion::Skip
got renamed toJump
and its functionality is now extended to bottom-up and combined traversals as follows:pub enum TreeNodeRecursion { /// Continue recursion with the next node. Continue, /// In top-down traversals skip recursing into children but continue with the next /// node, which actually means pruning of the subtree. /// In bottom-up traversals bypass calling bottom-up closures till the next leaf node. /// In combined traversals bypass calling bottom-up closures till the next top-down /// closure. Jump, /// Stop recursion. Stop, }
and changes
TreeNodeRewriter
to incorporate anf_down
and a nf_up
methods that both can change the node and both can use the commonTransformed
struct (containing aTreeNodeRecursion
enum) to control how to proceed with recursion:pub trait TreeNodeRewriter: Sized { /// The node type which is rewritable. type Node: TreeNode; /// Invoked while traversing down the tree before any children are rewritten / /// visited. /// Default implementation returns the node unmodified and continues recursion. fn f_down(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> { Ok(Transformed::no(node) } /// Invoked while traversing up the tree after all children have been rewritten / /// visited. /// Default implementation returns the node unmodified. fn f_up(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> { Ok(Transformed::no(node) } }
This solution is capable to replace all
RewriteRecursion
functionalities and make theTreeNodeRewriter
s not just cleaner but more versatile. -
-
This PR modifies the
fn transform<F>(self, op: &F) -> Result<Self>
, that is currently an alias totransform_up()
, to a more versatile:fn transform<FD, FU>( self, f_down: &mut FD, f_up: &mut FU, ) -> Result<Transformed<Self>> where FD: FnMut(Self) -> Result<Transformed<Self>>, FU: FnMut(Self) -> Result<Transformed<Self>>,
- Sometimes we don't need a full blown
TreeNodeRewriter
but defining 2 closures for top-down and bottom-up transformations is just enough. - I think that that the missing direction definition of current
transform()
makes is a bit confusing and explicitely usingtransform_down()
/transform_up()
is more robust. So we could usetransform()
for the 2 closure transformation.
- Sometimes we don't need a full blown
-
This PR changes the
Transformed
enum the following struct:pub struct Transformed<T> { pub data: T, pub transformed: bool, pub tnr: TreeNodeRecursion, }
-
This PR modifies the
transform_down()
andtransform_up()
methods to use the newTransformed
struct.
Please note that this PR implements 2. from https://github.com/apache/arrow-datafusion/pull/7942 and fixes the Transformed
enum issue: https://github.com/apache/arrow-datafusion/pull/8835
What changes are included in this PR?
This PR:
- Removes
RewriteRecursion
, - Renames
VisitRecursion
toTreeNodeRecursion
. The previousSkip
element is renamed toJump
and bottom-up functionality is added. - Modifies
TreeNodeRewriter
to incorporate anf_down
andf_up
methods and useTransformed
/TreeNodeRecursion
, - Modifies
TreeNode.transform()
to accept anf_down
andf_up
closures and useTransformed
/TreeNodeRecursion
. - Modifies
Transformed
enum to a new struct. - Modifies
TreeNode.transform_down()
andTreeNode.transform_up()
to use theTransformed
struct. - Alligns all
apply
/visit
/transform
/rewrite
APIs to usef
,f_down
,f_up
closure parameter names.
Are these changes tested?
Existng tests.
Are there any user-facing changes?
Yes.
cc @alamb, @andygrove, @yahoNanJing, @berkaysynnada this PR is another step to cleaner TreeNode
APIs.
This PR:
Removes RewriteRecursion, Renames VisitRecursion to TreeNodeRecursion, Modifies TreeNodeRewriter to incorporate an f_down and f_up methods and use TreeNodeRecursion
These changes seem reasonable to me and, since they propose a similar approach to other recursive flows, they make understanding easier.
However, I am not sure about this change:
Modifies TreeNode.transform() to accept an f_down and f_up closures and use TreeNodeRecursion.
transform()
is a simple method for traversing the graph with given self-concerned rules, being not interested in the type of traversal. This new version forces awareness of the traversal order, and there is also an implicit convention of traversing first bottom-up, then top-down.
Thanks for the feedback @berkaysynnada.
transform()
is a simple method for traversing the graph with given self-concerned rules, being not interested in the type of traversal. This new version forces awareness of the traversal order, and there is also an implicit convention of traversing first bottom-up, then top-down.
I feel that a simple (simpler than than TreeNodeRewriter
) API that is also capable to do both direction transformation in 1 traversal would be useful but I agree, it doesn't necessarily need to replace the existing transform()
.
Also, just to add a bit more details to this part of the PR: While I was local testing this PR I accidentally changed all transform()
to transform_down()
(so not to its current aliased transform_up()
implementation) and run into stack overflow errors in some tests. This suggested me that despite transform()
should be used only with closures that don't require any specific direction, some invocation do require transform
to do its currently defined bottom-up traversal. I'm sure that most of the transform()
usescases are direction agnostic, but I feel it would be less error prone if we required explicit direction from API users.
and there is also an implicit convention of traversing first bottom-up, then top-down.
I'm not sure I get this part.
I'm not sure I get this part.
The docstring you have updated for transform()
states that the order is:
1) f_down(ParentNode)
2) f_down(ChildNode1)
3) f_up(ChildNode1)
4) f_down(ChildNode2)
5) f_up(ChildNode2)
6) f_up(ParentNode)
I meant what if we intend to do such:
1) f_up(ChildNode1)
2) f_up(ChildNode2)
3) f_up(ParentNode)
4) f_down(ParentNode)
5) f_down(ChildNode1)
6) f_down(ChildNode2)
If we want to offer such a use, I think we may need to consider this way as well.
I'm not sure I get this part.
The docstring you have updated for
transform()
states that the order is:1) f_down(ParentNode) 2) f_down(ChildNode1) 3) f_up(ChildNode1) 4) f_down(ChildNode2) 5) f_up(ChildNode2) 6) f_up(ParentNode)
This is correct. This is because the proposed new transform()
does only one traversal. It is exactly the same order how the old and new TreeNode::rewrite
/ TreeNodeRewriter
worked and works.
If someone needs f_down
to process all nodes atop-down nd then f_up
to process all nodes bottom-up (or vice versa) then they need 2 traversals with subsequent explicit transform_down()
/ transform_up()
calls.
This is correct. This is because the proposed new
transform()
does only one traversal. It is exactly the same order how the old and newTreeNode::rewrite
/TreeNodeRewriter
worked and works.
I checked again. Yes, the order is correct. What I really meant is that when we use transform()
, it is assumed that we first pass from top to bottom and then from bottom to top. But if we want to do the opposite, i.e. first from bottom to top and then from top to bottom, we cannot use this transform()
, am I right?.
Since the nodes access their children, first top-down and then bottom-up traversals seem to be a single pass, doing it in reverse order is like 2 passes, but as a behavior, it may be desirable to have reverse order and should we provide a way for this?
This is correct. This is because the proposed new
transform()
does only one traversal. It is exactly the same order how the old and newTreeNode::rewrite
/TreeNodeRewriter
worked and works.I checked again. Yes, the order is correct. What I really meant is that when we use
transform()
, it is assumed that we first pass from top to bottom and then from bottom to top. But if we want to do the opposite, i.e. first from bottom to top and then from top to bottom, we cannot use thistransform()
, am I right?.
That's correct. I don't think that botom-up and then top-down can't be achieved in one run.
Since the nodes access their children, first top-down and then bottom-up traversals seem to be a single pass, doing it in reverse order is like 2 passes, but as a behavior, it may be desirable to have reverse order and should we provide a way for this?
API users can explicitly call transform_up()
and then transform_down()
(2 pass) if this is needed. Do you have a real use case for this?
API users can explicitly call
transform_up()
and thentransform_down()
(2 pass) if this is needed. Do you have a real use case for this?
Since we set out to simplify and provide easy to use TreeNode API and its related implementations as much as possible, IMO we need to reach a simple state at the first step. As far as I observed, there is no use for this version of transform()
, and it mixes the VisitRecursion
logic with transform logic (early return option is a newly introduced feature for transform()
methods). It may make sense to add such things when they are needed.
Since we set out to simplify and provide easy to use TreeNode API and its related implementations as much as possible, IMO we need to reach a simple state at the first step. As far as I observed, there is no use for this version of
transform()
, and it mixes theVisitRecursion
logic with transform logic (early return option is a newly introduced feature fortransform()
methods). It may make sense to add such things when they are needed.
Although I used the new transform
only once in my PR to replace and simplify a rewrite()
(https://github.com/apache/arrow-datafusion/pull/8891/files#diff-f69e720234d9da6cb3a4a178d3a5575fcd34f68191335a8c2465a172e8c4e6f1R653) I plan to use it in the future. But I'm ok with droping 2. from this PR. Let's hear some more feedback on this.
Hi @peter-toth -- thank you for this PR and the various others that are trying to improve the quality of code in DataFusion
I think one thing that is challenging for me is to understand how to evaluate the benefits of this and related PRs in an objective manner.
This PR, for example, may improve the consistency of APIs (which I happen to think it does) but making this change will also come at a cost for DataFusion users:
- If they use the
TreeNode
API they will have to change their code on upgrade - If they were familiar with the existing TreeNode API, they will have to learn the new API
For other breaking API changes, there are often other benefits that come with the cost, such as new features or improved performance.
I wonder if you can help us articulate how this (and the related refactoring PRs) help DataFusion users -- for example do they
- permit us to implement Expression or plan node rewrites more efficiently?
- Or will they allow us to consolidate / reduce code to reduce the maintenance burden to DataFusion maintainers
If we can connect this work to a benefit I suspect we would be able to drive it forward more quickly
Thanks for the question @alamb! I'm new to DataFusion so this discussion could help me a lot to understand what ways are viable in terms of improving the existing code. I came from a different open-source query engine but I think TreeNode
APIs are fundamental for each engine. While I was getting familiar with DataFusion I stumbled upon quite a few inconsistencies and missing features regarding these APIs. I was trying to throw in some ideas with my PRs that could help improving the situation. Some of these PRs are simply trying to improve consistency some others are performance related.
Here are a few things I found weird with the current APIs. These can cause a newcommer (like me) some confusion:
-
The
apply
/visit
functions are capable to prune (skip) subtrees or return immediately (stop) but thetransform
functions are not. To make it more confusionrewrite
is also capable to to prune, but instead of using a common way to control recursion,visit
andrewrite
use different enums with conflicting semantics. See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for details onVisitRecursion
vsRewriteRecursion
. -
There is this
Transformed
enum whose purpose is to explicitely define if any change was made to a node. This enum is used intransform
clousres but for some reason it is missing fromrewrite
. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes. See details in https://github.com/apache/arrow-datafusion/pull/8835. I believe the current state simply doesn't make sense and just causes confusion. -
rewrite
also seems to be inconsistent withtransform_up
andtransform_down
.rewrite
probably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures fromtransform_up
andtransform_down
to define:- what transformation should be done before (pre-order) and after (post-order) transformation of the node's children
- and how the recursion should continue .
See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the details. IMO the current
TreeNodeRewriter.pre_visit()
seems like a mess and its logic is hard to grasp.
What features I miss from the current API:
- Pruning capability of
transform
functions: Pruning would be very important as many of the tranformations wouldn't require traversing on the whole tree and could improve performance a lot. - Payload propagation:
I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In https://github.com/apache/arrow-datafusion/pull/8664 with the help of
transform_with_payload
I not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement).
So to recap my ideal TreeNode
transformation API would look like this:
/// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a
/// node while traversing the tree top-down (pre-order, before the node's children are
/// visited) while `f_up` is applied on a node while traversing the tree bottom-up
/// (post-order, after the the node's children are visited).
///
/// The `f_down` closure takes
/// - a `PD` type payload from its parent
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `PC` type payload to pass to `f_up`,
/// - a `Vec<FD>` type payload to propagate down to its children
/// (one `FD` element is propagated down to each child),
/// - a `TreeNodeRecursion` enum element to control recursion flow.
///
/// The `f_up` closure takes
/// - a `PC` type payload from `f_down` and
/// - a `Vec<PU>` type payload collected from its children
/// and returns a tuple made of:
/// - a possibly modified node,
/// - a `FU` type payload to propagate up to its parent,
/// - a `TreeNodeRecursion` enum element to control recursion flow.
fn transform_with_payload<FD, PD, PC, FU, PU>(
self,
f_down: &mut FD,
payload_down: PD,
f_up: &mut FU,
) -> Result<(Self, PU)>
where
FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>,
FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>,
Obviously the closure return types can be extracted to a type alias or struc
s like in the case of f_down
could be:
pub struct TransformDownResult<N, PD, PC> {
pub transformed_node: N,
pub payload_to_children: Vec<PD>,
pub payload_to_f_up: PC,
pub recursion_control: TreeNodeRecursion,
}
All the other functions should be just a specialization of the above:
-
apply
(visit_down
): Onlyf_down
closure that doesn't return a modified node and payload (only retrurnsTreeNodeRecursion
) -
visit
/TreeNodeVisitor
: Bothf_down
andf_up
closures needed. The closures can be incorporated into aTreeNodeVisitor
object but they don't return a modified node and payload -
transform_down
: Onlyf_down
closure that doesn't return payload -
transform_up
: Onlyf_up
closure that doesn't return payload -
transform
: Bothf_down
andf_up
closures needed, but they don't return payloads -
rewrite
/TreeNodeRewriter
: Bothf_down
andf_up
are incorporated into aTreeNodeRewriter
object, but they don't return a payload -
transform_down_with_payload
: Onlyf_down
closure -
transform_up_with_payload
: Onlyf_up
closure
I believe all these PRs would take us forward to a simple but still versarile APIs that have/are:
- performance benefits (e.g. pruning during transformation + improved algorithms in https://github.com/apache/arrow-datafusion/pull/8664),
- reduced code maintenance costs (e.g. see less dereived trees in https://github.com/apache/arrow-datafusion/pull/8664)
- consistent and easy to grasp (e.g. this PR + https://github.com/apache/arrow-datafusion/pull/8835)
Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where:
- we don't touch any existing API (maybe deprecate old ones)
- or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition.
I see -- thank you @peter-toth. Your explanation makes sense.
As a historical note, the current TreeNode API came out of unifying different in inconsistent APIs across the Exprs, LogicalPlan , ExectionPlan and PhysicalExprs. It was a large step forward at the time, but also just getting a single API was a major step forward.
Here are a few things I found weird with the current APIs. These can cause a newcommer (like me) some confusion:
I think this is a good insight -- basically that the rationale for this change is to make it easier for new contributors to contribute to DataFusion as they APIs are more uniform.
Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where:
we don't touch any existing API (maybe deprecate old ones) or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition.
I think implementing the "one true API" as you have proposed above and then migrating the rest of the codebase to use it (possibly deprecating the old APIs or changing them one at at time) would be the least disruptive.
To that end, what I will do is to create a new ticket describing this work so it is not lost in this ticket, and then we can use that ticket to make the overall plan visible
I filed https://github.com/apache/arrow-datafusion/issues/8913 -- let me know what you think. What do you think about creating a PR with transform_with_payload
and then a PR showing how to use it to improve one of the existing passes (or implement TreeNode::rewrite in terms of it or something?)
I filed #8913 -- let me know what you think. What do you think about creating a PR with
transform_with_payload
and then a PR showing how to use it to improve one of the existing passes (or implement TreeNode::rewrite in terms of it or something?)
Thanks @alamb! I will share my thoughts regarding transform_with_payload
there. Actually I wanted to wait with transform_with_payload
(https://github.com/apache/arrow-datafusion/pull/8664) until @ozankabak finishes https://github.com/apache/arrow-datafusion/pull/8817 as those 2 PRs are conflicting (https://github.com/apache/arrow-datafusion/pull/8664#issuecomment-1884878223) so it is good that we have dedicated ticket to discuss the 2 solutions.
The API inconsitency of the rewrite
API (see 1. in PR description) and the poposed new transform
semantics (see 2.) don't requite transform_with_payload
as a base API and this PR doesn't conflict with https://github.com/apache/arrow-datafusion/pull/8664 or https://github.com/apache/arrow-datafusion/pull/8817 so I think this PR is ready for review and feedback is welcome.
So basically I trust you two to come up with something good. I would be happy to merge this PR, or wait for #8817 and evaluate first.
Yes -- let's first get #8817 in and then I and @berkaysynnada will review and go through this in detail and the others in detail
So basically I trust you two to come up with something good. I would be happy to merge this PR, or wait for #8817 and evaluate first.
Yes -- let's first get #8817 in and then I and @berkaysynnada will review and go through this in detail and the others in detail
Sounds good to me. I left a comment https://github.com/apache/arrow-datafusion/pull/8817#issuecomment-1902666804 on https://github.com/apache/arrow-datafusion/pull/8817. It can help with further steps of the epic: https://github.com/apache/arrow-datafusion/issues/8913 Thanks @alamb, @ozankabak, @berkaysynnada.
Thanks everyone. I will carefully review this PR after #8817 is merged and the related updates are done. Thank you again @peter-toth for your efforts!
@alamb, @ozankabak, @berkaysynnada I rebased this PR after https://github.com/apache/arrow-datafusion/pull/8817 got merged. Please share your thoughts.
One thing that raises my "why is it like that?" instinct is the distinction between return types of down and up functions. Is there a specific reason we'd have such an asymmetry?
Oh yes, there is. The only enum element that would make sense in f_up
is TreeNodeRecursion::Stop
to fully stop recursion immediately (Continue
is the default behaviour, and Skip
doesn't make sense as we already visited/transformed childrens).
Currently we don't support TreeNodeRecursion::Stop
in f_down
either. (Maybe I forgot to document that.) This is because to support Stop
we would need TreeNode.rewrite()
/ TreeNode.map_children()
to be able to propagate Stop
up. (TreeNode::rewrite()
/ TreeNode::map_children()
would need to return TreeNodeRecursion
besides the transformed node similarly to TreeNode::visit()
/ TreeNode::apply_children()
, that returns TreeNodeRecursion
, and so Stop
is supported there.)
To sum up, I think we can add TreeNodeRecursion
to f_up
in the future so as to make f_down
and f_up
symmetric, but I would rather do that in a follow-up PR and I'm not sure we need that feature now.
I see. Since we are doing a major refactor of the API, it may make sense to do certain things now instead of having yet another API change at some point.
If there is no technical roadblock to trying it out, it'd be great to see how the code looks when symmetrized. I'd be happy to review it if you want to try it, or I can try myself as well
I see. Since we are doing a major refactor of the API, it may make sense to do certain things now instead of having yet another API change at some point.
If there is no technical roadblock to trying it out, it'd be great to see how the code looks when symmetrized. I'd be happy to review it if you want to try it, or I can try myself as well
I'm happy to work on this but I think it would require significant change in Expr::map_children
and since supporting Stop
in rewrite is a completely new feature that we don't know if we ever need please consider adding it in a separete follow-up PR.
Anyways, let me know if symmetry is a must have for this PR and I can add it next week.
Actually, one more thing came into my mind. There is this issue with Transformed
enum: https://github.com/apache/arrow-datafusion/pull/8835 and if we want to keep and fix it (see conversation on the ticket) it would also require change in TreeNode::rewrite()
/ TreeNodeRewriter
signature.
If for some reason we don't want to make API changes incrementally then we might need incorporate that one as well.
IMO there is one more option we could do: because both supporting Stop
in transform and making Transformed
useful are new features we could also change the APIs to its final form in this PR (with documenting the limitations) and add implementations in follow-up PRs.
@ozankabak, please let me know your thoughts on the above. Do you think it is ok to add TreeNodeRecursion::Stop
support and make TreeNodeRewriter
symmetric in a follow-up PR?
I don't think we should make two API changes. @berkaysynnada will do some exploring and collaborate with you to get this PR to the finish line
I don't think we should make two API changes. @berkaysynnada will do some exploring and collaborate with you to get this PR to the finish line
Actually, I plan to make more fixes/changes to the TreeNode API in a few more PRs, just didn't want to overwhelm you with PRs to review. Is it necessary to have all changes in one PR or is it ok to land these PRs in one DataFusion release?
I think there are good natural boundaries here. We can think about the Transformed
usage and the TreeNodeRecursion
enum in one PR (this one), and then think about the remaining steps in a subsequent PR.
I think there are good natural boundaries here. We can think about the
Transformed
usage and theTreeNodeRecursion
enum in one PR (this one), and then think about the remaining steps in a subsequent PR.
Ok, then let me update the PR with those.
I would like to review and collaborate on this PR. Could you please notify me when it is ready?
I would like to review and collaborate on this PR. Could you please notify me when it is ready?
Sure, let me get back to you today or tomorrow. And thanks for the help!