datafusion Fix TreeNode transform and rewrite APIs

Which issue does this PR close?

Part of https://github.com/apache/arrow-datafusion/issues/8913

Rationale for this change

The TreeNode.rewrite() / TreeNodeRewriter is very inconsistent with other TreeNode apply / visit functions:
- TreeNodeRewriter.pre_visit() currently returns RewriteRecursion so as to control the recursion but that enum is very inconsistent with the VisitRecursion enum that visit functions use.
- RewriteRecursion::Stop functionality is equal to VisitRecursion::Skip, that is to skip (prune) the subtree (don't recurse into children and don't run post-order functions).
- RewriteRecursion::Skip doesn't prevent recursing into children. This is different to VisitRecursion::Skip where recursion into children is not executed. VisitRecursion::Stop fully stops recursion which functionality isn't available with RewriteRecursion.
- The transformation defined in TreeNodeRewriter.mutate() is sometimes a top-down (pre-order) but other times a bottom-up (post-order) function depending on TreeNodeRewriter.pre_visit() result.
This PR removes RewriteRecursion and renames VisitRecursion to a common TreeNodeRecursion to control all (apply / visit / rewrite / transform) APIs. Previous VisitRecursion::Skip got renamed to Jump and its functionality is now extended to bottom-up and combined traversals as follows:
```
pub enum TreeNodeRecursion {
    /// Continue recursion with the next node.
    Continue,

    /// In top-down traversals skip recursing into children but continue with the next
    /// node, which actually means pruning of the subtree.
    /// In bottom-up traversals bypass calling bottom-up closures till the next leaf node.
    /// In combined traversals bypass calling bottom-up closures till the next top-down
    /// closure.
    Jump,

    /// Stop recursion.
    Stop,
}
```
and changes TreeNodeRewriter to incorporate an f_down and a nf_up methods that both can change the node and both can use the common Transformed struct (containing a TreeNodeRecursion enum) to control how to proceed with recursion:
```
pub trait TreeNodeRewriter: Sized {
    /// The node type which is rewritable.
    type Node: TreeNode;

    /// Invoked while traversing down the tree before any children are rewritten /
    /// visited.
    /// Default implementation returns the node unmodified and continues recursion.
    fn f_down(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> {
        Ok(Transformed::no(node)
    }

    /// Invoked while traversing up the tree after all children have been rewritten /
    /// visited.
    /// Default implementation returns the node unmodified.
    fn f_up(&mut self, node: Self::Node) -> Result<Transformed<Self::Node>> {
        Ok(Transformed::no(node)
    }
}
```
This solution is capable to replace all RewriteRecursion functionalities and make the TreeNodeRewriters not just cleaner but more versatile.
This PR modifies the fn transform<F>(self, op: &F) -> Result<Self>, that is currently an alias to transform_up(), to a more versatile:
```
 fn transform<FD, FU>(
     self,
     f_down: &mut FD,
     f_up: &mut FU,
 ) -> Result<Transformed<Self>>
 where
     FD: FnMut(Self) -> Result<Transformed<Self>>,
     FU: FnMut(Self) -> Result<Transformed<Self>>,
```
- Sometimes we don't need a full blown TreeNodeRewriter but defining 2 closures for top-down and bottom-up transformations is just enough.
- I think that that the missing direction definition of current transform() makes is a bit confusing and explicitely using transform_down() / transform_up() is more robust. So we could use transform() for the 2 closure transformation.

This PR changes the Transformed enum the following struct:

pub struct Transformed<T> {
    pub data: T,
    pub transformed: bool,
    pub tnr: TreeNodeRecursion,
}

This PR modifies the transform_down() and transform_up() methods to use the new Transformed struct.

Please note that this PR implements 2. from https://github.com/apache/arrow-datafusion/pull/7942 and fixes the Transformed enum issue: https://github.com/apache/arrow-datafusion/pull/8835

What changes are included in this PR?

This PR:

Removes RewriteRecursion,
Renames VisitRecursion to TreeNodeRecursion. The previous Skip element is renamed to Jump and bottom-up functionality is added.
Modifies TreeNodeRewriter to incorporate an f_down and f_up methods and use Transformed / TreeNodeRecursion,
Modifies TreeNode.transform() to accept an f_down and f_up closures and use Transformed / TreeNodeRecursion.
Modifies Transformed enum to a new struct.
Modifies TreeNode.transform_down() and TreeNode.transform_up() to use the Transformed struct.
Alligns all apply / visit / transform / rewrite APIs to use f, f_down, f_up closure parameter names.

Are these changes tested?

Existng tests.

Are there any user-facing changes?

Yes.

Jan 17 '24 08:01 peter-toth

cc @alamb, @andygrove, @yahoNanJing, @berkaysynnada this PR is another step to cleaner TreeNode APIs.

Jan 17 '24 12:01 peter-toth

This PR:

Removes RewriteRecursion, Renames VisitRecursion to TreeNodeRecursion, Modifies TreeNodeRewriter to incorporate an f_down and f_up methods and use TreeNodeRecursion

These changes seem reasonable to me and, since they propose a similar approach to other recursive flows, they make understanding easier.

However, I am not sure about this change:

Modifies TreeNode.transform() to accept an f_down and f_up closures and use TreeNodeRecursion.

transform() is a simple method for traversing the graph with given self-concerned rules, being not interested in the type of traversal. This new version forces awareness of the traversal order, and there is also an implicit convention of traversing first bottom-up, then top-down.

Jan 17 '24 13:01 berkaysynnada

Thanks for the feedback @berkaysynnada.

transform() is a simple method for traversing the graph with given self-concerned rules, being not interested in the type of traversal. This new version forces awareness of the traversal order, and there is also an implicit convention of traversing first bottom-up, then top-down.

I feel that a simple (simpler than than TreeNodeRewriter) API that is also capable to do both direction transformation in 1 traversal would be useful but I agree, it doesn't necessarily need to replace the existing transform().

Also, just to add a bit more details to this part of the PR: While I was local testing this PR I accidentally changed all transform() to transform_down() (so not to its current aliased transform_up() implementation) and run into stack overflow errors in some tests. This suggested me that despite transform() should be used only with closures that don't require any specific direction, some invocation do require transform to do its currently defined bottom-up traversal. I'm sure that most of the transform() usescases are direction agnostic, but I feel it would be less error prone if we required explicit direction from API users.

and there is also an implicit convention of traversing first bottom-up, then top-down.

I'm not sure I get this part.

Jan 17 '24 14:01 peter-toth

I'm not sure I get this part.

The docstring you have updated for transform() states that the order is:

1) f_down(ParentNode)
2) f_down(ChildNode1)
3) f_up(ChildNode1)
4) f_down(ChildNode2)
5) f_up(ChildNode2)
6) f_up(ParentNode)

I meant what if we intend to do such:

1) f_up(ChildNode1)
2) f_up(ChildNode2)
3) f_up(ParentNode)
4) f_down(ParentNode)
5) f_down(ChildNode1)
6) f_down(ChildNode2)

If we want to offer such a use, I think we may need to consider this way as well.

Jan 17 '24 14:01 berkaysynnada

I'm not sure I get this part.

The docstring you have updated for transform() states that the order is:
1) f_down(ParentNode)
2) f_down(ChildNode1)
3) f_up(ChildNode1)
4) f_down(ChildNode2)
5) f_up(ChildNode2)
6) f_up(ParentNode)

This is correct. This is because the proposed new transform() does only one traversal. It is exactly the same order how the old and new TreeNode::rewrite / TreeNodeRewriter worked and works.

If someone needs f_down to process all nodes atop-down nd then f_up to process all nodes bottom-up (or vice versa) then they need 2 traversals with subsequent explicit transform_down() / transform_up() calls.

Jan 17 '24 14:01 peter-toth

This is correct. This is because the proposed new transform() does only one traversal. It is exactly the same order how the old and new TreeNode::rewrite / TreeNodeRewriter worked and works.

I checked again. Yes, the order is correct. What I really meant is that when we use transform(), it is assumed that we first pass from top to bottom and then from bottom to top. But if we want to do the opposite, i.e. first from bottom to top and then from top to bottom, we cannot use this transform(), am I right?.

Since the nodes access their children, first top-down and then bottom-up traversals seem to be a single pass, doing it in reverse order is like 2 passes, but as a behavior, it may be desirable to have reverse order and should we provide a way for this?

Jan 17 '24 14:01 berkaysynnada

This is correct. This is because the proposed new transform() does only one traversal. It is exactly the same order how the old and new TreeNode::rewrite / TreeNodeRewriter worked and works.

I checked again. Yes, the order is correct. What I really meant is that when we use transform(), it is assumed that we first pass from top to bottom and then from bottom to top. But if we want to do the opposite, i.e. first from bottom to top and then from top to bottom, we cannot use this transform(), am I right?.

That's correct. I don't think that botom-up and then top-down can't be achieved in one run.

Since the nodes access their children, first top-down and then bottom-up traversals seem to be a single pass, doing it in reverse order is like 2 passes, but as a behavior, it may be desirable to have reverse order and should we provide a way for this?

API users can explicitly call transform_up() and then transform_down() (2 pass) if this is needed. Do you have a real use case for this?

Jan 17 '24 15:01 peter-toth

API users can explicitly call transform_up() and then transform_down() (2 pass) if this is needed. Do you have a real use case for this?

Since we set out to simplify and provide easy to use TreeNode API and its related implementations as much as possible, IMO we need to reach a simple state at the first step. As far as I observed, there is no use for this version of transform(), and it mixes the VisitRecursion logic with transform logic (early return option is a newly introduced feature for transform() methods). It may make sense to add such things when they are needed.

Jan 17 '24 16:01 berkaysynnada

Since we set out to simplify and provide easy to use TreeNode API and its related implementations as much as possible, IMO we need to reach a simple state at the first step. As far as I observed, there is no use for this version of transform(), and it mixes the VisitRecursion logic with transform logic (early return option is a newly introduced feature for transform() methods). It may make sense to add such things when they are needed.

Although I used the new transform only once in my PR to replace and simplify a rewrite() (https://github.com/apache/arrow-datafusion/pull/8891/files#diff-f69e720234d9da6cb3a4a178d3a5575fcd34f68191335a8c2465a172e8c4e6f1R653) I plan to use it in the future. But I'm ok with droping 2. from this PR. Let's hear some more feedback on this.

Jan 17 '24 16:01 peter-toth

Hi @peter-toth -- thank you for this PR and the various others that are trying to improve the quality of code in DataFusion

I think one thing that is challenging for me is to understand how to evaluate the benefits of this and related PRs in an objective manner.

This PR, for example, may improve the consistency of APIs (which I happen to think it does) but making this change will also come at a cost for DataFusion users:

If they use the TreeNode API they will have to change their code on upgrade
If they were familiar with the existing TreeNode API, they will have to learn the new API

For other breaking API changes, there are often other benefits that come with the cost, such as new features or improved performance.

I wonder if you can help us articulate how this (and the related refactoring PRs) help DataFusion users -- for example do they

permit us to implement Expression or plan node rewrites more efficiently?
Or will they allow us to consolidate / reduce code to reduce the maintenance burden to DataFusion maintainers

If we can connect this work to a benefit I suspect we would be able to drive it forward more quickly

Jan 18 '24 18:01 alamb

Thanks for the question @alamb! I'm new to DataFusion so this discussion could help me a lot to understand what ways are viable in terms of improving the existing code. I came from a different open-source query engine but I think TreeNode APIs are fundamental for each engine. While I was getting familiar with DataFusion I stumbled upon quite a few inconsistencies and missing features regarding these APIs. I was trying to throw in some ideas with my PRs that could help improving the situation. Some of these PRs are simply trying to improve consistency some others are performance related.

Here are a few things I found weird with the current APIs. These can cause a newcommer (like me) some confusion:

The apply / visit functions are capable to prune (skip) subtrees or return immediately (stop) but the transform functions are not. To make it more confusion rewrite is also capable to to prune, but instead of using a common way to control recursion, visit and rewrite use different enums with conflicting semantics. See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for details on VisitRecursion vs RewriteRecursion.
There is this Transformed enum whose purpose is to explicitely define if any change was made to a node. This enum is used in transform clousres but for some reason it is missing from rewrite. Moreover this explicit information is neither propogatged up to API callee nor it is used for detecting changes in nodes. See details in https://github.com/apache/arrow-datafusion/pull/8835. I believe the current state simply doesn't make sense and just causes confusion.
rewrite also seems to be inconsistent with transform_up and transform_down. rewrite probably organically ended up in its current state and capable to do its current job, but what I would expect from such a function is that it simply incorporates 2 closures from transform_up and transform_down to define:
- what transformation should be done before (pre-order) and after (post-order) transformation of the node's children
- and how the recursion should continue .
See this PR (https://github.com/apache/arrow-datafusion/pull/8891) for the details. IMO the current TreeNodeRewriter.pre_visit() seems like a mess and its logic is hard to grasp.

What features I miss from the current API:

Pruning capability of transform functions: Pruning would be very important as many of the tranformations wouldn't require traversing on the whole tree and could improve performance a lot.
Payload propagation: I think the the reason why there are so many (4 base + 9 derived) tree implementations (with many code duplications) in DataFusion is that there is no API to describe a transformation while also propagating state. In https://github.com/apache/arrow-datafusion/pull/8664 with the help of transform_with_payload I not just eleminated 7 derived trees but also improved some of the algorightms to require only one traversal (also a perf improvement).

So to recap my ideal TreeNode transformation API would look like this:

    /// Transforms the tree using `f_down` and `f_up` closures. `f_down` is applied on a
    /// node while traversing the tree top-down (pre-order, before the node's children are
    /// visited) while `f_up` is applied on a node while traversing the tree bottom-up
    /// (post-order, after the the node's children are visited).
    ///
    /// The `f_down` closure takes
    /// - a `PD` type payload from its parent
    /// and returns a tuple made of:
    /// - a possibly modified node,
    /// - a `PC` type payload to pass to `f_up`,
    /// - a `Vec<FD>` type payload to propagate down to its children
    ///    (one `FD` element is propagated down to each child),
    /// - a `TreeNodeRecursion` enum element to control recursion flow.
    ///
    /// The `f_up` closure takes
    /// - a `PC` type payload from `f_down` and
    /// - a `Vec<PU>` type payload collected from its children
    /// and returns a tuple made of:
    /// - a possibly modified node,
    /// - a `FU` type payload to propagate up to its parent,
    /// - a `TreeNodeRecursion` enum element to control recursion flow.
    fn transform_with_payload<FD, PD, PC, FU, PU>(
        self,
        f_down: &mut FD,
        payload_down: PD,
        f_up: &mut FU,
    ) -> Result<(Self, PU)>
    where
        FD: FnMut(Self, PD) -> Result<(Self, Vec<PD>, PC, TreeNodeRecursion)>,
        FU: FnMut(Self, PC, Vec<PU>) -> Result<(Self, PU, TreeNodeRecursion)>,

Obviously the closure return types can be extracted to a type alias or strucs like in the case of f_down could be:

pub struct TransformDownResult<N, PD, PC> {
    pub transformed_node: N,
    pub payload_to_children: Vec<PD>,
    pub payload_to_f_up: PC,
    pub recursion_control: TreeNodeRecursion,
}

All the other functions should be just a specialization of the above:

apply (visit_down): Only f_down closure that doesn't return a modified node and payload (only retrurns TreeNodeRecursion)
visit / TreeNodeVisitor: Both f_down and f_up closures needed. The closures can be incorporated into a TreeNodeVisitor object but they don't return a modified node and payload
transform_down: Only f_down closure that doesn't return payload
transform_up: Only f_up closure that doesn't return payload
transform: Both f_down and f_up closures needed, but they don't return payloads
rewrite / TreeNodeRewriter: Both f_down and f_up are incorporated into a TreeNodeRewriter object, but they don't return a payload
transform_down_with_payload: Only f_down closure
transform_up_with_payload: Only f_up closure

I believe all these PRs would take us forward to a simple but still versarile APIs that have/are:

performance benefits (e.g. pruning during transformation + improved algorithms in https://github.com/apache/arrow-datafusion/pull/8664),
reduced code maintenance costs (e.g. see less dereived trees in https://github.com/apache/arrow-datafusion/pull/8664)
consistent and easy to grasp (e.g. this PR + https://github.com/apache/arrow-datafusion/pull/8835)

Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where:

we don't touch any existing API (maybe deprecate old ones)
or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition.

Jan 19 '24 10:01 peter-toth

I see -- thank you @peter-toth. Your explanation makes sense.

As a historical note, the current TreeNode API came out of unifying different in inconsistent APIs across the Exprs, LogicalPlan , ExectionPlan and PhysicalExprs. It was a large step forward at the time, but also just getting a single API was a major step forward.

Here are a few things I found weird with the current APIs. These can cause a newcommer (like me) some confusion:

I think this is a good insight -- basically that the rationale for this change is to make it easier for new contributors to contribute to DataFusion as they APIs are more uniform.

Now, as you mentioned some of these proposed changes have conflicting semantics to current APIs. Honestly I can't decide how much turbulance they would cause and so I'm seeking feedback from you and the community. But please note that I don't insist on any of the above functionaly or namings. These are just ideas and I can imagine implementations where:

we don't touch any existing API (maybe deprecate old ones) or modify existing APIs to make them consistent, but also keep old API alternatives to ease the transition.

I think implementing the "one true API" as you have proposed above and then migrating the rest of the codebase to use it (possibly deprecating the old APIs or changing them one at at time) would be the least disruptive.

To that end, what I will do is to create a new ticket describing this work so it is not lost in this ticket, and then we can use that ticket to make the overall plan visible

Jan 19 '24 20:01 alamb

I filed https://github.com/apache/arrow-datafusion/issues/8913 -- let me know what you think. What do you think about creating a PR with transform_with_payload and then a PR showing how to use it to improve one of the existing passes (or implement TreeNode::rewrite in terms of it or something?)

Jan 19 '24 20:01 alamb

I filed #8913 -- let me know what you think. What do you think about creating a PR with transform_with_payload and then a PR showing how to use it to improve one of the existing passes (or implement TreeNode::rewrite in terms of it or something?)

Thanks @alamb! I will share my thoughts regarding transform_with_payload there. Actually I wanted to wait with transform_with_payload (https://github.com/apache/arrow-datafusion/pull/8664) until @ozankabak finishes https://github.com/apache/arrow-datafusion/pull/8817 as those 2 PRs are conflicting (https://github.com/apache/arrow-datafusion/pull/8664#issuecomment-1884878223) so it is good that we have dedicated ticket to discuss the 2 solutions.

The API inconsitency of the rewrite API (see 1. in PR description) and the poposed new transform semantics (see 2.) don't requite transform_with_payload as a base API and this PR doesn't conflict with https://github.com/apache/arrow-datafusion/pull/8664 or https://github.com/apache/arrow-datafusion/pull/8817 so I think this PR is ready for review and feedback is welcome.

Jan 20 '24 07:01 peter-toth

So basically I trust you two to come up with something good. I would be happy to merge this PR, or wait for #8817 and evaluate first.

Yes -- let's first get #8817 in and then I and @berkaysynnada will review and go through this in detail and the others in detail

Jan 21 '24 13:01 ozankabak

So basically I trust you two to come up with something good. I would be happy to merge this PR, or wait for #8817 and evaluate first.

Yes -- let's first get #8817 in and then I and @berkaysynnada will review and go through this in detail and the others in detail

Sounds good to me. I left a comment https://github.com/apache/arrow-datafusion/pull/8817#issuecomment-1902666804 on https://github.com/apache/arrow-datafusion/pull/8817. It can help with further steps of the epic: https://github.com/apache/arrow-datafusion/issues/8913 Thanks @alamb, @ozankabak, @berkaysynnada.

Jan 21 '24 15:01 peter-toth

Thanks everyone. I will carefully review this PR after #8817 is merged and the related updates are done. Thank you again @peter-toth for your efforts!

Jan 22 '24 07:01 berkaysynnada

@alamb, @ozankabak, @berkaysynnada I rebased this PR after https://github.com/apache/arrow-datafusion/pull/8817 got merged. Please share your thoughts.

Jan 28 '24 09:01 peter-toth

One thing that raises my "why is it like that?" instinct is the distinction between return types of down and up functions. Is there a specific reason we'd have such an asymmetry?

Oh yes, there is. The only enum element that would make sense in f_up is TreeNodeRecursion::Stop to fully stop recursion immediately (Continue is the default behaviour, and Skip doesn't make sense as we already visited/transformed childrens). Currently we don't support TreeNodeRecursion::Stop in f_down either. (Maybe I forgot to document that.) This is because to support Stop we would need TreeNode.rewrite() / TreeNode.map_children() to be able to propagate Stop up. (TreeNode::rewrite() / TreeNode::map_children() would need to return TreeNodeRecursion besides the transformed node similarly to TreeNode::visit() / TreeNode::apply_children(), that returns TreeNodeRecursion, and so Stop is supported there.)

To sum up, I think we can add TreeNodeRecursion to f_up in the future so as to make f_down and f_up symmetric, but I would rather do that in a follow-up PR and I'm not sure we need that feature now.

Jan 28 '24 13:01 peter-toth

I see. Since we are doing a major refactor of the API, it may make sense to do certain things now instead of having yet another API change at some point.

If there is no technical roadblock to trying it out, it'd be great to see how the code looks when symmetrized. I'd be happy to review it if you want to try it, or I can try myself as well

Jan 28 '24 13:01 ozankabak

I see. Since we are doing a major refactor of the API, it may make sense to do certain things now instead of having yet another API change at some point.

If there is no technical roadblock to trying it out, it'd be great to see how the code looks when symmetrized. I'd be happy to review it if you want to try it, or I can try myself as well

I'm happy to work on this but I think it would require significant change in Expr::map_children and since supporting Stop in rewrite is a completely new feature that we don't know if we ever need please consider adding it in a separete follow-up PR.

Jan 28 '24 13:01 peter-toth

Anyways, let me know if symmetry is a must have for this PR and I can add it next week.

Jan 28 '24 14:01 peter-toth

Actually, one more thing came into my mind. There is this issue with Transformed enum: https://github.com/apache/arrow-datafusion/pull/8835 and if we want to keep and fix it (see conversation on the ticket) it would also require change in TreeNode::rewrite() / TreeNodeRewriter signature.

If for some reason we don't want to make API changes incrementally then we might need incorporate that one as well.

IMO there is one more option we could do: because both supporting Stop in transform and making Transformed useful are new features we could also change the APIs to its final form in this PR (with documenting the limitations) and add implementations in follow-up PRs.

Jan 28 '24 14:01 peter-toth

@ozankabak, please let me know your thoughts on the above. Do you think it is ok to add TreeNodeRecursion::Stop support and make TreeNodeRewriter symmetric in a follow-up PR?

Jan 30 '24 16:01 peter-toth

I don't think we should make two API changes. @berkaysynnada will do some exploring and collaborate with you to get this PR to the finish line

Jan 30 '24 16:01 ozankabak

I don't think we should make two API changes. @berkaysynnada will do some exploring and collaborate with you to get this PR to the finish line

Actually, I plan to make more fixes/changes to the TreeNode API in a few more PRs, just didn't want to overwhelm you with PRs to review. Is it necessary to have all changes in one PR or is it ok to land these PRs in one DataFusion release?

Jan 30 '24 16:01 peter-toth

I think there are good natural boundaries here. We can think about the Transformed usage and the TreeNodeRecursion enum in one PR (this one), and then think about the remaining steps in a subsequent PR.

Jan 30 '24 17:01 ozankabak

I think there are good natural boundaries here. We can think about the Transformed usage and the TreeNodeRecursion enum in one PR (this one), and then think about the remaining steps in a subsequent PR.

Ok, then let me update the PR with those.

Jan 30 '24 17:01 peter-toth

I would like to review and collaborate on this PR. Could you please notify me when it is ready?

Feb 01 '24 14:02 berkaysynnada

I would like to review and collaborate on this PR. Could you please notify me when it is ready?

Sure, let me get back to you today or tomorrow. And thanks for the help!

Feb 01 '24 14:02 peter-toth

datafusion datafusion copied to clipboard

Fix TreeNode transform and rewrite APIs

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

datafusion
datafusion copied to clipboard