datafusion
datafusion copied to clipboard
Examples of using `TreeNode` APIs to walk and manipulate LogicalPlans
Is your feature request related to a problem or challenge?
The usecase of walking LogicalPlan to either analyze the plan or rewrite it has come up multiple times recently
Specifically I think both @goldmedal and @cisaacson have asked about this recently and so I think the documentation is not clear enough
Describe the solution you'd like
In order to help people do this, I would like to add some examples of how to walk the tree
I suggest first adding simple doc example in the docs of LogicalPlan itself
The example would do somethig like create a plan using LogicalPlanBuilder
Project
Filter
Scan
And then show how to:
- Use
TreeNode::applyto recursively find something about the scan (maybe the table name?) - Use the
TreeNode::rewriteto recursively rewrite and remove the Filter (or something)
I think those would be relatively short examples to write and would illustrate the key apis
Describe alternatives you've considered
Bonus points for a more full featured example (perhaps linked from the LogicalPlan docs) of how use / walk / manipulate LogicalPlans
Similarly to https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/expr_api.rs, perahaps datafusion-examples/examples/plan_walk.rs or something
Additional context
No response
@alamb This is a great idea. As I learn more perhaps I can help.
Thank you @cisaacson
Moving out of slack into Github so it might be more easily found
If your usecase is to to get the list of filters and tables that appear in a query, one way to do this is:
- Get the list of tables by finding all
LogicalPlan::TableScan(any Join will have an input from a TableScan) - Get the list of filters/predicates, look in
LogicalPlan::Filterand potentially the filters on theLogicalPlan::TableScan
So I think it could look something like (untested)
let mut referenced_tables = HashSet::new();
let mut filters = vec![];
// recursively visit all nodes in the tree (including subqueries)
logical_plan.apply_with_subqueries(|plan| {
match plan {
// record table names
LogicalPlan::TableScan(table_scan) => {
referenced_tables.insert(table_scan.table_name);
},
// record filters
LogicalPlan::Filter(filter) => {
filters.push(filter.predicate);
}
// ignore other nodes
_ => {}
Ok(TreeNodeRecursion::Continue)
}
})?;
This looks very good, pretty much what I implemented. The only question I have remaining is:
If TableScan has filters why would that not catch all filters? What does LogicalPlan::Filter catch that the TableScan.filters would not? For my specific use case I only need filters on the tables that are present in the query.
If TableScan has filters why would that not catch all filters?
Depending on the value of https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TableProviderFilterPushDown.html sometimes filters would not be pushed down into the provider
Also filters can be duplicated in the table scan and filter (again depending on the value of the FilterPushDown
Got it, that makes total sense. I will only care about the accepted pushdown filters, so the TableScan will work for what we need.