datafusion-ballista icon indicating copy to clipboard operation
datafusion-ballista copied to clipboard

starts_with function is serialised as UDF

Open mpurins-coralogix opened this issue 3 years ago • 0 comments

Describe the bug When physical plan includes datafusion builtin function starts_with then it is serialized as UDF.

To Reproduce

Following test (can be added in somewhere in https://github.com/apache/arrow-ballista/blob/master/ballista/core/src/serde/physical_plan/mod.rs) should pass but it fails with DataFusionError(Plan("There is no UDF named \"startswith\" in the registry"))

#[test]
    fn roundtrip_builtin_startswith_function() -> Result<()> {
        let field_a = Field::new("a", DataType::Utf8, false);
        let field_b = Field::new("b", DataType::Utf8, false);
        let schema = Arc::new(Schema::new(vec![field_a, field_b]));

        let input = Arc::new(EmptyExec::new(false, schema.clone()));

        let execution_props = ExecutionProps::new();

        let fun_expr = functions::create_physical_fun(
            &BuiltinScalarFunction::StartsWith,
            &execution_props,
        )?;

        let expr = ScalarFunctionExpr::new(
            &format!("{}", BuiltinScalarFunction::StartsWith), // Note that this is how ballista creates this name in https://github.com/apache/arrow-datafusion/blob/master/datafusion/physical-expr/src/functions.rs#L183
            fun_expr,
            vec![col("a", &schema)?, col("b", &schema)?],
            &DataType::Boolean,
        );

        let project =
            ProjectionExec::try_new(vec![(Arc::new(expr), "a".to_string())], input)?;

        roundtrip_test(Arc::new(project))
    }

Expected behavior Above test should pass

Additional context This is pretty much happening because UDFs and builtin functions are represented with the same ScalarFunctionExpr and ballista when serializing uses function name (which is startswith) and BuiltinScalarFunction::from_str (which matches only on starts_with) to differentiate between builtin and user defined functions.

While simplest fix probably would be in datafusion BuiltinScalarFunction::from_str to handle startswith as function name as well I believe that it would be better to fix it in some other way where builtin and udf functions are thrown in the same bag. Current approach means that in some given project code could unexpectedly break on datafusion updates if datafusion introduces new builtin function and project already had user defined function under the same name.

I wonder if it would be reasonable to separate builtin and user defined functions already in datafusion physical plan.

mpurins-coralogix avatar Dec 24 '22 02:12 mpurins-coralogix