datafusion
datafusion copied to clipboard
Add reference visitor `TreeNode` APIs
Which issue does this PR close?
Part of https://github.com/apache/datafusion/issues/10121, required for https://github.com/apache/datafusion/issues/10505 and https://github.com/apache/datafusion/issues/10426.
Rationale for this change
The current TreeNode visitor APIs (TreeNode::visit() and TreeNode::apply()) have a limiation due to the lifetimes of the TreeNode references passed to TreeNodeVisitor::f_down(), TreeNodeVisitor::f_up() and the f closure of apply() don't match the lifetime of the root TreeNode reference on which the APIs are called.
This restriction means that we can't build up data structures that contain references to descendant treenodes.
E.g. the following code snippet to collect how many times subexpressions occur in an expression tree doesn't work:
let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m: HashMap<&Expr, usize> = HashMap::new();
e.apply(|e| {
*m.entry(e).or_insert(0) += 1;
Ok(TreeNodeRecursion::Continue)
});
println!("m: {:#?}", m);
This PR changes the TreeNode visitor APIs to make sure the lifetime of references match the lifetime of the root TreeNode reference so the above example will work.
Please note:
The LogicalPlan::apply_with_subqueries() and LogicalPlan::visit_with_subqueries() APIs, that are similar to TreeNode's base APIs but provide subquery support, can't be made stricter easily. This is because in LogicalPlan::apply_expressions() and LogicalPlan::apply_subqueries() we create temporary Expr::eq, Expr::Column and LogicalPlan::Subquery objects that are not compatible with the root treenode's lifetime.
What changes are included in this PR?
- This PR modifies
TreeNode::apply(),TreeNode::visit()andTreeNode::apply_children()APIs. TreeNodeimplementations (Expr,LogicalPlan,ConcreteTreeNodeandDynTreeNode) are amended to be able to implement the stricter APIs.
Are these changes tested?
Yes, with new UTs.
Are there any user-facing changes?
No.
cc @alamb, @berkaysynnada, @ozankabak
cc @alamb, @berkaysynnada, @ozankabak
Thanks @peter-toth. After a quick look, I started thinking it might be better to use this new ref_visitor API instead of keeping also the original one. I'll take a closer look tomorrow to see if that makes sense.
Thanks @peter-toth. After a quick look, I started thinking it might be better to use this new ref_visitor API instead of keeping also the original one. I'll take a closer look tomorrow to see if that makes sense.
Sure. We can go that way, but the changes needed will be bigger, see the description of the PR for the details.
I agree with @berkaysynnada in https://github.com/apache/datafusion/pull/10543#issuecomment-2115389289 that in an ideal world we woul change TreeNode::visit and TreeNode::apply, however as @peter-toth notes this would:
- Be a much larger change (and we would have to sort out how it works for
DynTreeNodes) - be a breaking API change
Thus, I suggest we merge this PR as is, and file a follow on PR to potentially unify the API. While it in an ideal world we wouldn't have both, I think this is a step in the right direction
Is that ok with you @berkaysynnada ?
Breaking the full migration into two PRs (with the latter one removing the old usage and migrating to the new one) sounds reasonable to me. It will probably make sense to do the follow-on quickly because we already will have one major API change (format options) in the next version. Instead of having two versions with two major API changes, it'd probably be better to have one version with these API changes lumped in together.
@peter-toth, how big do you think this PR will be if we do all the changes at once?
@ozankabak, let me check the changes required for the alternative today or tomorrow and come back to you.
I'm still working on an alternative to this PR and will need a couple of more days to test a few different ideas...
No worries. Will be happy to review and help iterate once you are ready
What do we think about merging this PR and filing a follow on ticket to unify the APIs?
What do we think about merging this PR and filing a follow on ticket to unify the APIs?
I'm ok with merging the current state of the PR. But I was also thinking about how to improve it:
As far as I see we have 3 options here:
- Adjust the current
TreeNodeimplementations to be compatible with the more strictapply_ref()/visit_ref()and then replace the currentapply()andvisit()implementation to their..._ref()versions. This change would require:- adding a new method to all
DynTreeNodes to retrun references to their children, - fixing
LogicalPlan::Joinas it only contains the left and right expressions in itson: Vec<(Expr, Expr)>field andLogicalPlan:apply_expressions()creates a temporaryExpr::eqinstance on the top of the pairs. But creating such temporary objects during visit is not compatible with the stricter API so we should changeLogicalPlan::Jointo containExpr::eqs directly. - fixing
Expr::Exists,Expr::InSubqueryandExpr::ScalarSubqueryas they only contain theSubquerystrcuts andLogicalPlan::apply_subqueries()creates a temporaryLogicalPlan::Subqueryinstance. We should change the 3 expressions to containLogicalPlan::Subquerys directly.
- adding a new method to all
- Keep the current
apply()/visit()and their..._ref()variants separate like this PR does. Maybe we could adjust the the implementation to not offerapply_ref()/visit_ref()(instead of their current throwing an error behaviour) onTreeNodesthat doesn't support it (DynTreeNodes). - Implement
apply()/visit()in a way that the references used in the API closures andf_down()/f_up()methods encode if the reference is a strict reference (whose lifetime match the root node's lifetime) or a "loose" reference whose reference doesn't match (used for returning references to temporary node instances). I.e. we could introduce a new tree node reference type:
This change would require to introduce apub enum TreeNodeRefImpl<'n, 'a, N: TreeNode> { Permanent(&'n N), // reference matches the root node's lifetime: 'n Temporary(&'a N), // reference doesn't match the root node's lifetime }TreeNodeReftrait and moving theapply()/visit()functionality to there. Both theTreeNode'sapply_children()implementations and the API user's closures andf_down()/f_up()methods would need to take into account if the currentTreeNodeRefis permanent or temporary, which would make the whole thing quite complex...
So far I've been playing with 3. but it became very complex and still doesn't fully work. Also, I'm no longer sure that such an API makes sense as the API user can't specify what kind of references they want. E.g. in CSE (https://github.com/apache/datafusion/pull/10473) I need permanent references and can't do anything with other references whose lifetime doesn't match the root node's lifetime.
So now I'm working on implementing 1., but it will be a pervasive change as there are 56 (ExecutionPlan) + 15 (PhysicalExpr) DynTreeNode implementations in the current codebase. They currently offer fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> and fn children(&self) -> Vec<Arc<dyn PhysicalExpr>> methods to get their children and now I'm changing those to return Vec<&Arc<dyn ...>>.
But if you have a better idea please share.
I've rebased the PR on the latest main. The first commit is exactly the same as the previous state of this PR was, the second commit changes TreeNode:apply() and TreeNode:visit() APIs to use references whose lifetime matches the root treenode's lifetime.
The only TreeNode that was not compatible with this stricter APIs is DynTreeNode. I had to change it and its implementations to return children as Vec<&Arc<Self>> (instead of Vec<Arc<Self>>).
This PR doesn't modify the LogicalPlan::apply_with_subqueries() and LogicalPlan:visit_with_subqueries() APIs. This is becaususe in LogicalPlan::apply_expressions() and LogicalPlan::apply_subqueries() we create temporary Expr::eq, Expr::Column and LogicalPlan::Subquery objects that are not compatible with the root treenode's lifetime.
If we wanted to make the LogicalPlan visitor APIs stricter we would need further changes in LogicalPlan::Join, LogicalPlan::Unnest, LogicalPlan::Extension, Expr::Exists, Expr::InSubquery, Expr::ScalarSubquery enums, but I don't think we need those stricter APIs currently...
I've also updated the PR description.
Thanks a lot, @peter-toth! This looks great to me. The new children(&self) -> Vec<&Arc<dyn ExecutionPlan>> signature also uncovers implicit clones, and many of them seem to be redundant. Our TreeNode API functionality has reached a very good level. You've done a great job.
Perhaps we could add documentation with code snippets to exemplify the usage of methods and their purposes, to ease the experience for less experienced users.
I hope to review this PR later today or tomorrow
Awesome!
I plan to merge this PR tomorrow unless anyone else would like time to review.
I filed https://github.com/apache/datafusion/issues/10678 to handle visit_with_subqueries
I am waiting to merge this PR until the CI is fixed on main https://github.com/apache/datafusion/pull/10677
Thank you again @peter-toth @ozankabak and @berkaysynnada -- I am woking on some doc examples for this as well
I made these two PRs to add examples of using TreeNode:
- https://github.com/apache/datafusion/pull/10686
- https://github.com/apache/datafusion/pull/10687
Also a bonus to improve: https://github.com/apache/datafusion/pull/10685
Thanks all for the review!
@alamb, those API example PRs look great!