datafusion
datafusion copied to clipboard
RFC / Prototype user defined sql planner might look like
I don't intend to merge this PR.
Which issue does this PR close?
Prototype of what the API described in https://github.com/apache/datafusion/issues/10534 might look like
Inspired by https://github.com/apache/datafusion/pull/11155 from @jayzhan211 and @samuelcolvin on https://github.com/apache/datafusion/pull/11137 there
Rationale for this change
I feel like our sql planner needs some way for it to be extended so that we can plan sql queries with operators that are not hard coded into the core engine. There are currently two different approaches
- https://github.com/apache/datafusion/pull/11155 from @jayzhan211 for operators --> functions
- https://github.com/apache/datafusion/pull/11137 from @samuelcolvin for custom operations --> functions
What changes are included in this PR?
Prototype out what a "user defined planner" might look like
Are these changes tested?
Are there any user-facing changes?
I think this seems like a fair approach.
It's worth noting that we could do some of this with https://github.com/apache/datafusion/pull/11137 - e.g. we could move the array stuff into a ParseCustomOperator, I don't really mind which approach we go with provided the flexibility is there.
Indeed -- thanks. Let's see if @jayzhan211 has any thoughts (I think it may be their night)
It is a great idea overall! There are only two things that I think we could improve on
- Avoid clone like what you have mentioned, I don't have a good solution for now too.
- Extend it for more kinds of rewrite like FieldAccess
plan_field_access
Given the current design, if we want to support plan_field_access we need to add the function in trait, and doesn't look flexible enough if there are many kinds of custom planning
for planner in self.planners.iter() {
if let Some(expr) =
planner.plan_binary_op(op.clone(), left.clone(), right.clone(), schema)?
{
return Ok(expr);
}
if let Some(expr) =
planner.plan_indices(expr....)?
{
return Ok(expr);
}
...
}
We could define a more general function and enum that includes all the possible ast node. I'm not sure if this is general enough since these are the only two things that I know are rewritten to function in datafusion right now.
// Node from sqlparser that are possible for rewrite
pub enum AstNode (or Operators) {
BinaryOperator(sqlparser::ast::BinaryOperator),
Subscript(sqlparser::ast::Subscript), // For FieldAccess rewrite
OR GetFieldAccess(GetFieldAccess) // We can use GetFieldAccess directly, if there is little possible that they need to customize Subscript to GetFieldAccess logic.
}
/// This trait allows users to customize the behavior of the SQL planner
pub trait UserDefinedPlanner {
fn plan(&self, ast_node: AstNode, exprs: Vec<Expr>, _schema: &DFSchema) -> Result<Option<Expr>> {
Ok(None)
}
}
I find a way to avoid clone, returns tuple Result<(Option<PlannedNode>, Option<Expr>)> original_expr and new_expr mutually exclusive
pub struct PlannedNode {
pub op: sqlparser::ast::BinaryOperator,
pub left: Expr,
pub right: Expr,
}
/// This trait allows users to customize the behavior of the SQL planner
pub trait UserDefinedPlanner {
/// Plan the binary operation between two expressions, return None if not possible
fn plan_binary_op(
&self,
planned_node: PlannedNode,
_schema: &DFSchema,
) -> Result<(Option<PlannedNode>, Option<Expr>)> {
Ok((Some(planned_node), None))
}
}
fn build_logical_expr(
&self,
op: sqlparser::ast::BinaryOperator,
left: Expr,
right: Expr,
schema: &DFSchema,
) -> Result<Expr> {
let num_planners = self.planners.len();
// try extension planers
let mut planned_node = PlannedNode{op, left, right};
for (i, planner) in self.planners.iter().enumerate() {
match planner.plan_binary_op(planned_node, schema)? {
(None, Some(expr)) => return Ok(expr),
(Some(n), None) if i + 1 < num_planners => planned_node = n,
(Some(n), None) if i + 1 == num_planners => {
return Ok(Expr::BinaryExpr(BinaryExpr::new(
Box::new(n.left),
self.parse_sql_binary_op(n.op)?,
Box::new(n.right),
)))
}
_ => unreachable!(""),
}
}
unreachable!("")
}
I find a way to avoid clone, returns tuple Result<(Option<PlannedNode>, Option<Expr>)> original_expr and new_expr mutually exclusive
That is a good idea. I think we might be able to name the tuple with an explicitly enum, similar to ExprSimplifyResult which might be easier to understand
https://github.com/apache/datafusion/blob/c80da91e0fc6f0817b65cdf99fe2a50be9680fab/datafusion/expr/src/simplify.rs#L113-L119
Extend it for more kinds of rewrite like FieldAccess plan_field_access
Yes, good idea
Given the current design, if we want to support plan_field_access we need to add the function in trait, and doesn't look flexible enough if there are many kinds of custom planning
I agree there is a tradeoff between the number of functions in the trait vs how general this custom planning could be
We could define a more general function and enum that includes all the possible ast node. I'm not sure if this is general enough since these are the only two things that I know are rewritten to function in datafusion right now.
Yeah, I am not sure how much more custom planning would be needed. I think part of most special syntaxes have embedded sub expressions, so SqlToRel will still need code to know how to plan such exprs.
Thus, while less flexible, I think adding a specific function for each type of SQL ast node that we permit customization of would be the easiest to understand and use
Here is a proposal for how to proceed:
- Merge https://github.com/apache/datafusion/pull/11155 (so we don't have a third pattern to deal with)
- Make a PR with a production ready version of the API in this PR (sql planner extension)
What do you think @jayzhan211 and @samuelcolvin ? If you think this is a reasonable idea, do either if you have the time to do 2? If not, I will find the time to bash it out
Thus, while less flexible, I think adding a specific function for each type of SQL ast node that we permit customization of would be the easiest to understand and use
I agree, we can start from this, it is easy to extend it in the future if there are more kinds of plan
Make a PR with a production ready version of the API in this PR (sql planner extension)
I can work on this
This sounds like a good plan.
We (I) should also adapt/restart https://github.com/datafusion-contrib/datafusion-functions-json/pull/22 to work with this API to prove it does what we need before merging.
This sounds like a good plan.
We (I) should also adapt/restart datafusion-contrib/datafusion-functions-json#22 to work with this API to prove it does what we need before merging.
I agree -- sounds like a plan. Thank you both! I will make sure to review the code from @jayzhan211 as soon as possible so we can get this done quickly
This is going to be great!