datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Examples of using `TreeNode` APIs to walk and manipulate LogicalPlans

Open alamb opened this issue 1 year ago • 2 comments

Is your feature request related to a problem or challenge?

The usecase of walking LogicalPlan to either analyze the plan or rewrite it has come up multiple times recently

Specifically I think both @goldmedal and @cisaacson have asked about this recently and so I think the documentation is not clear enough

Describe the solution you'd like

In order to help people do this, I would like to add some examples of how to walk the tree

I suggest first adding simple doc example in the docs of LogicalPlan itself

The example would do somethig like create a plan using LogicalPlanBuilder

  Project
    Filter
     Scan

And then show how to:

  1. Use TreeNode::apply to recursively find something about the scan (maybe the table name?)
  2. Use the TreeNode::rewrite to recursively rewrite and remove the Filter (or something)

I think those would be relatively short examples to write and would illustrate the key apis

Describe alternatives you've considered

Bonus points for a more full featured example (perhaps linked from the LogicalPlan docs) of how use / walk / manipulate LogicalPlans

Similarly to https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/expr_api.rs, perahaps datafusion-examples/examples/plan_walk.rs or something

Additional context

No response

alamb avatar May 22 '24 20:05 alamb

@alamb This is a great idea. As I learn more perhaps I can help.

cisaacson avatar May 23 '24 16:05 cisaacson

Thank you @cisaacson

alamb avatar May 23 '24 17:05 alamb

Moving out of slack into Github so it might be more easily found

If your usecase is to to get the list of filters and tables that appear in a query, one way to do this is:

  • Get the list of tables by finding all LogicalPlan::TableScan (any Join will have an input from a TableScan)
  • Get the list of filters/predicates, look in LogicalPlan::Filter and potentially the filters on the LogicalPlan::TableScan

So I think it could look something like (untested)

let mut referenced_tables = HashSet::new();
let mut filters = vec![];

// recursively visit all nodes in the tree (including subqueries)
logical_plan.apply_with_subqueries(|plan| {
  match plan {
   // record table names
    LogicalPlan::TableScan(table_scan) => { 
      referenced_tables.insert(table_scan.table_name);
   },
    // record filters
    LogicalPlan::Filter(filter) => {
      filters.push(filter.predicate);
     }
    // ignore other nodes
    _ => {}
     
      Ok(TreeNodeRecursion::Continue)
  }
 })?;

alamb avatar May 24 '24 10:05 alamb

This looks very good, pretty much what I implemented. The only question I have remaining is:

If TableScan has filters why would that not catch all filters? What does LogicalPlan::Filter catch that the TableScan.filters would not? For my specific use case I only need filters on the tables that are present in the query.

cisaacson avatar May 24 '24 14:05 cisaacson

If TableScan has filters why would that not catch all filters?

Depending on the value of https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.TableProviderFilterPushDown.html sometimes filters would not be pushed down into the provider

Also filters can be duplicated in the table scan and filter (again depending on the value of the FilterPushDown

alamb avatar May 24 '24 14:05 alamb

Got it, that makes total sense. I will only care about the accepted pushdown filters, so the TableScan will work for what we need.

cisaacson avatar May 24 '24 15:05 cisaacson