datafusion Support "User defined coercion" rules

Is your feature request related to a problem or challenge?

DataFusion automatically "coerces" (see docs here) input argument types to match the types required of operations or functions.

For functions, this is described by a desired TypeSignature

pub enum TypeSignature {
    Variadic(Vec<DataType>),
    VariadicEqual,
    VariadicAny,
    Uniform(usize, Vec<DataType>),
    Exact(Vec<DataType>),
    Any(usize),
    OneOf(Vec<TypeSignature>),
    ArraySignature(ArrayFunctionSignature),
}

However, some functions have special hard coded coercion logic such as sum and count (TODO link) as well as some Array functions like make_array. We started down the path of encoding the special array semantics into TypeSignature (see ArrayFunctionSignature))

However, as we continue to find other examples of different desired rules (most recently in sum and count), TypeSignature will grow and become more and more specialized

Describe the solution you'd like

@jayzhan211 had a great suggestion https://github.com/apache/datafusion/pull/10268#discussion_r1593511241 that in addition to encoding common coercion behaviors in TypeSignature, we can also add a variant of TypeSignature that permits user defined coercion rules

Describe alternatives you've considered

No response

Additional context

No response

May 08 '24 14:05 alamb

I found that to have coerce_types for both scalarUDF and aggregateUDF, I would need to introduce a more general trait for both function

impl UDFImpl for T {
    fn name(&self) -> &str {
       &self.name
    }

    fn coerce_types(&self, data_types: &[DataType]) -> Result<Vec<DataType>> {
        not_impl_err!("Function {} does not implement coerce_types", self.name)
    }
}

impl ScalarUDFImpl: UDFImpl
impl AggregateUDFImpl: UDFImpl

Then we can have

fn coerce_arguments_for_signature(
    expressions: Vec<Expr>,
    schema: &DFSchema,
    signature: &Signature,
    func: Arc<dyn UDFImpl>,
) -> Result<Vec<Expr>> {}

Another alternative is having a duplicate function for scalar and aggregate for a related function

fn coerce_arguments_for_signature(
    expressions: Vec<Expr>,
    schema: &DFSchema,
    signature: &Signature,
    func: &ScalarUDF,
) -> Result<Vec<Expr>> {}

fn coerce_arguments_for_signature(
    expressions: Vec<Expr>,
    schema: &DFSchema,
    signature: &Signature,
    func: &AggregateUDF,
) -> Result<Vec<Expr>> {}

I think the first option is potentially beneficial in the long run(?) but the user now needs to define two traits. The second option only increases the maintenance cost.

What do you think about this @alamb I also track if there is rust solution for this in https://users.rust-lang.org/t/inheritance-like-with-rust-trait/111102

May 09 '24 00:05 jayzhan211

It is a good observation that ScalarUDFImpl and AggregateUDFImpl don't share a common base trait and thus adding functionality that affects both requires duplication of code

I would need to introduce a more general trait for both function Another alternative is having a duplicate function for scalar and aggregate for a related function

I agree with your analysis of the tradeoffs: a common base trait would result in less duplication in DataFusion

However, I personally prefer duplicating coerce_arguments_for_signature in each trait rather than introducing a common base trait because:

It is backwards compatible (not an API change for the existing library of functions)
Makes it slightly easier to implement ScalarUDF and AggregateUDF (especially when new to rust) -- rather than two impls for your function, you only need one

May 09 '24 10:05 alamb