datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Discussion: make it easier for specify SQL --> function translation

Open alamb opened this issue 1 year ago • 5 comments

I extracted this from a conversation between @jayzhan211 and myself

Basically the usecase is that the sql planner converts things like [1, 2, 3] to make_array(1, 2, 3) and the name make_array is hard coded

          > @alamb Do you think we should also rewrite the array operator to function in parser step? It is currently in optimizer step. I think the downside of moving array rewrite in parser step is that if any user expects different array function with the same syntax, then they can't do it since we don't have "user-defined" parser mechanism now. But the benefit is that we can eliminate intermediate binary expression.

I agree that changing the parser to insert a call to get field access directly is a good idea (and would be consistent and allow us to remove Expr::GetFieldAccess

Have you thought about "user-defined" parser idea before, the way that user can define their own expression to get from the syntax? Is it useful in production? 🤔

One thing I have thought about is changing the hard coded lookup of function names from a pattern like this

https://github.com/apache/datafusion/blob/fc34dacdb9842cde4d056d5a659796ede4ae5e74/datafusion/sql/src/expr/value.rs#L144-L150

To be something more structured

pub trait PlannerFunctions {
  /// return the UDF to use for creating arrays ("make_array") by default:
  fn make_array(&self) -> Result<ScalarUDF>;
...
  // similar functions for other built in functions
}

And then instead of

 if let Some(udf) = self.context_provider.get_function_meta("make_array") { 
     Ok(Expr::ScalarFunction(ScalarFunction::new_udf(udf, values))) 
 } else { 
     not_impl_err!( 
         "array_expression featrue is disable, So should implement make_array UDF by yourself" 
     ) 
 } 

The planner might look like

  let udf = self.planner_functions.make_array()?;
  Ok(Expr::ScalarFunction(ScalarFunction::new_udf(udf, values))) 

But I haven't had a usecase to do that myself yet

Originally posted by @alamb in https://github.com/apache/datafusion/issues/10374#issuecomment-2111006004

alamb avatar May 15 '24 18:05 alamb

@jayzhan211 asks https://github.com/apache/datafusion/issues/10374#issuecomment-2111340945

@alamb Do you think we should also rewrite the array operator to function in parser step? It is currently in optimizer step. I think the downside of moving array rewrite in parser step is that if any user expects different array function with the same syntax, then they can't do it since we don't have "user-defined" parser mechanism now. But the benefit is that we can eliminate intermediate binary expression.

The array operator to function is syntax like array1 || array2 -> array_concat, which is in ArrayFunctionRewriter now, so I'm thinking about whether we should move this to the parser or not.

I personally think moving the translation of array1 || array --> array_concat to the parser is a better idea as it will be more efficient than trying to rewrite an expr after the fact

alamb avatar May 15 '24 18:05 alamb

I rethought the issue in #10102, and I found it is strongly related to the idea of the user-defined parser mentioned here, that we can define the returned Expr given the registered function.

jayzhan211 avatar May 16 '24 11:05 jayzhan211

In general, I think the idea of allowing users to customize the behavior of the sql planner is reasonable. However I am not entirely sure if we need to modify the planner itself, or if it would be a better approach for users to implement rewrite passes after the existing planner runs 🤔

alamb avatar May 16 '24 18:05 alamb

For anyone following along, @samuelcolvin @jayzhan211 and myself are discussing proposals for this API in

https://github.com/apache/datafusion/pull/11137 and https://github.com/apache/datafusion/pull/11168

alamb avatar Jun 29 '24 11:06 alamb

https://github.com/apache/datafusion/pull/11180 -- looking very nice

alamb avatar Jun 30 '24 11:06 alamb

To rewrite with sql planner

  • [ ] date_part
  • [ ] create_struct
  • [ ] create_named_struct
  • [ ] sql_overlay_to_expr
  • [ ] sql_position_to_expr
  • [ ] sql_substring_to_expr
  • [ ] sql_compound_identifier_to_expr

jayzhan211 avatar Jul 02 '24 00:07 jayzhan211

I filed https://github.com/apache/datafusion/issues/11207 to track the work to move the remaining functions to the user defined extension planner so closing this one

alamb avatar Jul 02 '24 10:07 alamb