datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

RFC / Prototype user defined sql planner might look like

Open alamb opened this issue 1 year ago • 9 comments

I don't intend to merge this PR.

Which issue does this PR close?

Prototype of what the API described in https://github.com/apache/datafusion/issues/10534 might look like

Inspired by https://github.com/apache/datafusion/pull/11155 from @jayzhan211 and @samuelcolvin on https://github.com/apache/datafusion/pull/11137 there

Rationale for this change

I feel like our sql planner needs some way for it to be extended so that we can plan sql queries with operators that are not hard coded into the core engine. There are currently two different approaches

  • https://github.com/apache/datafusion/pull/11155 from @jayzhan211 for operators --> functions
  • https://github.com/apache/datafusion/pull/11137 from @samuelcolvin for custom operations --> functions

What changes are included in this PR?

Prototype out what a "user defined planner" might look like

Are these changes tested?

Are there any user-facing changes?

alamb avatar Jun 28 '24 17:06 alamb

I think this seems like a fair approach.

It's worth noting that we could do some of this with https://github.com/apache/datafusion/pull/11137 - e.g. we could move the array stuff into a ParseCustomOperator, I don't really mind which approach we go with provided the flexibility is there.

Indeed -- thanks. Let's see if @jayzhan211 has any thoughts (I think it may be their night)

alamb avatar Jun 28 '24 19:06 alamb

It is a great idea overall! There are only two things that I think we could improve on

  1. Avoid clone like what you have mentioned, I don't have a good solution for now too.
  2. Extend it for more kinds of rewrite like FieldAccess plan_field_access

Given the current design, if we want to support plan_field_access we need to add the function in trait, and doesn't look flexible enough if there are many kinds of custom planning

        for planner in self.planners.iter() {
            if let Some(expr) =
                planner.plan_binary_op(op.clone(), left.clone(), right.clone(), schema)?
            {
                return Ok(expr);
            }
            if let Some(expr) =
                planner.plan_indices(expr....)?
            {
                return Ok(expr);
            }
            ...

        }

We could define a more general function and enum that includes all the possible ast node. I'm not sure if this is general enough since these are the only two things that I know are rewritten to function in datafusion right now.

// Node from sqlparser that are possible for rewrite
pub enum AstNode (or Operators) {
    BinaryOperator(sqlparser::ast::BinaryOperator),
    Subscript(sqlparser::ast::Subscript), // For FieldAccess rewrite
 OR GetFieldAccess(GetFieldAccess) //  We can use GetFieldAccess directly, if there is little possible that they need to customize Subscript to GetFieldAccess logic.
}

/// This trait allows users to customize the behavior of the SQL planner
pub trait UserDefinedPlanner {
    fn plan(&self, ast_node: AstNode, exprs: Vec<Expr>, _schema: &DFSchema) -> Result<Option<Expr>> {
        Ok(None)
    }
}

jayzhan211 avatar Jun 29 '24 01:06 jayzhan211

I find a way to avoid clone, returns tuple Result<(Option<PlannedNode>, Option<Expr>)> original_expr and new_expr mutually exclusive

pub struct PlannedNode {
    pub op: sqlparser::ast::BinaryOperator,
    pub left: Expr,
    pub right: Expr,
}

/// This trait allows users to customize the behavior of the SQL planner
pub trait UserDefinedPlanner {
    /// Plan the binary operation between two expressions, return None if not possible
    fn plan_binary_op(
        &self,
        planned_node: PlannedNode,
        _schema: &DFSchema,
    ) -> Result<(Option<PlannedNode>, Option<Expr>)> {
        Ok((Some(planned_node), None))
    }
}
fn build_logical_expr(
        &self,
        op: sqlparser::ast::BinaryOperator,
        left: Expr,
        right: Expr,
        schema: &DFSchema,
    ) -> Result<Expr> {
        let num_planners = self.planners.len();
        // try extension planers
        let mut planned_node = PlannedNode{op, left, right};
        for (i, planner) in self.planners.iter().enumerate() {
            match planner.plan_binary_op(planned_node, schema)? {
                (None, Some(expr)) => return Ok(expr),
                (Some(n), None) if i + 1 < num_planners => planned_node = n,
                (Some(n), None) if i + 1 == num_planners => {
                    return Ok(Expr::BinaryExpr(BinaryExpr::new(
                        Box::new(n.left),
                        self.parse_sql_binary_op(n.op)?,
                        Box::new(n.right),
                    )))
                }
                _ => unreachable!(""),
            }
        }
        unreachable!("")
    }

jayzhan211 avatar Jun 29 '24 05:06 jayzhan211

I find a way to avoid clone, returns tuple Result<(Option<PlannedNode>, Option<Expr>)> original_expr and new_expr mutually exclusive

That is a good idea. I think we might be able to name the tuple with an explicitly enum, similar to ExprSimplifyResult which might be easier to understand

https://github.com/apache/datafusion/blob/c80da91e0fc6f0817b65cdf99fe2a50be9680fab/datafusion/expr/src/simplify.rs#L113-L119

alamb avatar Jun 29 '24 10:06 alamb

Extend it for more kinds of rewrite like FieldAccess plan_field_access

Yes, good idea

Given the current design, if we want to support plan_field_access we need to add the function in trait, and doesn't look flexible enough if there are many kinds of custom planning

I agree there is a tradeoff between the number of functions in the trait vs how general this custom planning could be

We could define a more general function and enum that includes all the possible ast node. I'm not sure if this is general enough since these are the only two things that I know are rewritten to function in datafusion right now.

Yeah, I am not sure how much more custom planning would be needed. I think part of most special syntaxes have embedded sub expressions, so SqlToRel will still need code to know how to plan such exprs.

Thus, while less flexible, I think adding a specific function for each type of SQL ast node that we permit customization of would be the easiest to understand and use

alamb avatar Jun 29 '24 11:06 alamb

Here is a proposal for how to proceed:

  1. Merge https://github.com/apache/datafusion/pull/11155 (so we don't have a third pattern to deal with)
  2. Make a PR with a production ready version of the API in this PR (sql planner extension)

What do you think @jayzhan211 and @samuelcolvin ? If you think this is a reasonable idea, do either if you have the time to do 2? If not, I will find the time to bash it out

alamb avatar Jun 29 '24 11:06 alamb

Thus, while less flexible, I think adding a specific function for each type of SQL ast node that we permit customization of would be the easiest to understand and use

I agree, we can start from this, it is easy to extend it in the future if there are more kinds of plan

Make a PR with a production ready version of the API in this PR (sql planner extension)

I can work on this

jayzhan211 avatar Jun 29 '24 11:06 jayzhan211

This sounds like a good plan.

We (I) should also adapt/restart https://github.com/datafusion-contrib/datafusion-functions-json/pull/22 to work with this API to prove it does what we need before merging.

samuelcolvin avatar Jun 29 '24 11:06 samuelcolvin

This sounds like a good plan.

We (I) should also adapt/restart datafusion-contrib/datafusion-functions-json#22 to work with this API to prove it does what we need before merging.

I agree -- sounds like a plan. Thank you both! I will make sure to review the code from @jayzhan211 as soon as possible so we can get this done quickly

This is going to be great!

alamb avatar Jun 29 '24 17:06 alamb