arrow ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF

This PR includes the support for multi-kernel registration with Scalar UDFs for Python.

Oct 05 '22 05:10 vibhatha

cc @pitrou @jorisvandenbossche @westonpace

Could you please review this PR?

Oct 05 '22 06:10 vibhatha

https://issues.apache.org/jira/browse/ARROW-16212

Oct 05 '22 08:10 github-actions[bot]

@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you.

Oct 05 '22 08:10 pitrou

@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you.

Of course, thanks for the ping. This was picked up before that issue was assigned (yesterday). I just finalized this. And I am working on the one I picked yesterday.

Oct 05 '22 08:10 vibhatha

cc @westonpace

Dec 29 '22 04:12 vibhatha

A few nits.

My main concern is that, rather than have input_types be a dictionary, I think it would be best to have the argument names as a separate field. Otherwise it just forces the user to grapple with questions like "why is it a dictionary? What am I supposed to use for the key? Can different rows have different keys?" when that could be more intuitive.

def test_multi_kernel_registration():
    def unary_function(ctx, x):
        return pc.cast(pc.call_function("multiply", [x, 2],
                                        memory_pool=ctx.memory_pool), x.type)
    func_name = "y=x*2"
    unary_doc = {"summary": "multiply by two function",
                 "description": "test multiply function"}
    input_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]

    input_names = ["array"]

    output_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]
    pc.register_scalar_function(unary_function,
                                func_name,
                                unary_doc,
                                input_types,
                                input_names,
                                output_types)

I see your point, but I have a counter argument for this (just my opinion).

If we keep a separate list, it also makes it unclear which name goes for which type. Having them together makes it readable. We had a case for aggreagtes, where having them separate made it confusing. I am not saying it is the exact same case, but close enough. I would rather propose, documenting what we do clearly with a few examples in the docstring.

Jan 03 '23 03:01 vibhatha

We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can.

Jan 03 '23 15:01 westonpace

We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can.

Yes that's a fair argument. I will work on it.

Jan 03 '23 15:01 vibhatha

@westonpace I updated the PR and added a test case to validate the length of the input_types and input_names

Jan 03 '23 17:01 vibhatha

I have two issues here:

The API is ugly.
One cannot register different implementations for different input types. This might be an annoying limitation (and/or a performance issue if people work around it by checking types in Python).

Jan 04 '23 13:01 pitrou

2. One cannot register different implementations for different input types. This might be an annoying limitation  (and/or a performance issue if people work around it by checking types in Python).

Assuming the actual UDF implementation is in C I think the "python UDF" is just mapping to the appropriate external library call and the external library will handle the type checking itself. For example, I know we had an example at one point where we registered numpy's gcd function as a UDF. Since numpy does its own type dispatch we could use the same kernel for all integral input types.

If the actual UDF implementation is in python then I think performance doesn't matter in this context (e.g. prototyping) and they are probably mapping with something like to_pylist and so one callable would be able to handle all numeric types (e.g. in C we have int32_t and float but in python those would both just be number).

1. The API is ugly.

I'm not sure how best to address this. If I start from scratch I think I'd end up with something a bit more complex but hopefully use friendly (note, this proposal addresses your point 2 as well):

class Udf(object):
    """
    A user defined function that can be registered with pyarrow so that
    it can be used in expression evaluation.

    Parameters
    ----------
    name : str
        The name of the function, function names must be unique
    kernels: List[UdfKernel]
        One or more kernels which map combinations of input types
        to python callables that provide the function implementation
    doc : UdfDocumentation
        Optional object describing the function.  Can be used in
        interactive contexts to provide help to a user creating expressions

class UdfDocumentation(object):
    """
    Documentation that describes a user defined function.  This can be
    displayed as helpful information to users that are authoring compute
    expressions interactively.

    summary: str, optional
        A brief summary of the function, defaults to the function name
    description: str, optional
        An extended description that can be used to describe function details.
        Defaults to an empty string.
    arg_names: List[str], optional
        Names to give the arguments.  Defaults to ["arg0", "arg1", ...]

class UdfKernel(object):
    """
    A mapping from input types to a python callable.  The same python callable
    can be used in multiple `UdfKernel` instances if the callable can handle arrays
    of different types.

    During execution of an expression Arrow will pick the appropriate kernel based
    on the types of the arguments.  A kernel will only be selected if the kernel's
    input types match the argument types exactly.

    All kernels within a Udf must have the same number of input types.
    """
    input_types: List[DataType]
        The input types expected by `exec`
    output_type: DataType
        The output type of the array returned by the `exec`
    exec: Callable
        The python callable to execute when this kernel is invoked
        <description of how the callable will be invoked>

The example would then become:

def test_multi_kernel_registration():
    def unary_function(ctx, x):
        return pc.cast(pc.call_function("multiply", [x, 2],
                                        memory_pool=ctx.memory_pool), x.type)

    func_name = "y=x*2"
    unary_doc = {"summary": "multiply by two function",
                 "description": "test multiply function"}
    input_types = [
        pa.int8(),
        pa.int16(),
        pa.int32(),
        pa.int64(),
        pa.float32(),
        pa.float64()
    ]
    kernels = [UdfKernel([in_type], in_type, unary_function) for in_type in input_types]
    func_doc = UdfDocumentation("multiply by two function", "test multiply function", ["array"])
    udf = Udf("y=x*2", kernels, doc=func_doc)
    pc.register_scalar_function(udf)

@pitrou would that be a more favorable API or can you give more description of what you would like to change?

Jan 04 '23 17:01 westonpace

One other possibility would be to first create the function and then register kernels one by one:

udf = pc.register_scalar_function(
    name="y=x*2",
    arg_names=["x"],
    doc = {"summary": "multiply by two function",
           "description": "test multiply function"}
    )
for in_type in [pa.int64(), pa.float64()]:
    udf.add_kernel([in_type], in_type, impl)

It could also open the way to a decorator-like syntax:


udf = pc.register_scalar_function(
    name="y=x*2",
    arg_names=["x"],
    doc = {"summary": "multiply by two function",
           "description": "test multiply function"}
    )

@udf.kernel([pa.float64()], pa.float64())
@udf.kernel([pa.int64()], pa.int64())
def impl(ctx, x):
    # ...

@jorisvandenbossche Opinions on which kind of API would be desirable here?

Jan 05 '23 16:01 pitrou

@pitrou what approach should we take here? I completely understand your concern and will do the necessary changes to make this better.

Also one proposal, this API is still experimental, do you think it is the best to improve it right now or do it in a follow up PR.

Jan 23 '23 03:01 vibhatha

Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍

Mar 30 '23 17:03 amol-