Support "User defined coercion" rules
Is your feature request related to a problem or challenge?
DataFusion automatically "coerces" (see docs here) input argument types to match the types required of operations or functions.
For functions, this is described by a desired TypeSignature
pub enum TypeSignature {
Variadic(Vec<DataType>),
VariadicEqual,
VariadicAny,
Uniform(usize, Vec<DataType>),
Exact(Vec<DataType>),
Any(usize),
OneOf(Vec<TypeSignature>),
ArraySignature(ArrayFunctionSignature),
}
However, some functions have special hard coded coercion logic such as sum and count (TODO link) as well as some Array functions like make_array. We started down the path of encoding the special array semantics into TypeSignature (see ArrayFunctionSignature))
However, as we continue to find other examples of different desired rules (most recently in sum and count), TypeSignature will grow and become more and more specialized
Describe the solution you'd like
@jayzhan211 had a great suggestion https://github.com/apache/datafusion/pull/10268#discussion_r1593511241 that in addition to encoding common coercion behaviors in TypeSignature, we can also add a variant of TypeSignature that permits user defined coercion rules
Describe alternatives you've considered
No response
Additional context
No response
I found that to have coerce_types for both scalarUDF and aggregateUDF, I would need to introduce a more general trait for both function
impl UDFImpl for T {
fn name(&self) -> &str {
&self.name
}
fn coerce_types(&self, data_types: &[DataType]) -> Result<Vec<DataType>> {
not_impl_err!("Function {} does not implement coerce_types", self.name)
}
}
impl ScalarUDFImpl: UDFImpl
impl AggregateUDFImpl: UDFImpl
Then we can have
fn coerce_arguments_for_signature(
expressions: Vec<Expr>,
schema: &DFSchema,
signature: &Signature,
func: Arc<dyn UDFImpl>,
) -> Result<Vec<Expr>> {}
Another alternative is having a duplicate function for scalar and aggregate for a related function
fn coerce_arguments_for_signature(
expressions: Vec<Expr>,
schema: &DFSchema,
signature: &Signature,
func: &ScalarUDF,
) -> Result<Vec<Expr>> {}
fn coerce_arguments_for_signature(
expressions: Vec<Expr>,
schema: &DFSchema,
signature: &Signature,
func: &AggregateUDF,
) -> Result<Vec<Expr>> {}
I think the first option is potentially beneficial in the long run(?) but the user now needs to define two traits. The second option only increases the maintenance cost.
What do you think about this @alamb I also track if there is rust solution for this in https://users.rust-lang.org/t/inheritance-like-with-rust-trait/111102
It is a good observation that ScalarUDFImpl and AggregateUDFImpl don't share a common base trait and thus adding functionality that affects both requires duplication of code
I would need to introduce a more general trait for both function Another alternative is having a duplicate function for scalar and aggregate for a related function
I agree with your analysis of the tradeoffs: a common base trait would result in less duplication in DataFusion
However, I personally prefer duplicating coerce_arguments_for_signature in each trait rather than introducing a common base trait because:
- It is backwards compatible (not an API change for the existing library of functions)
- Makes it slightly easier to implement ScalarUDF and AggregateUDF (especially when new to rust) -- rather than two
impls for your function, you only need one