ARROW-16212: [C++][Python] Register Multiple Kernels for a UDF
This PR includes the support for multi-kernel registration with Scalar UDFs for Python.
cc @pitrou @jorisvandenbossche @westonpace
Could you please review this PR?
https://issues.apache.org/jira/browse/ARROW-16212
@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you.
@vibhatha Can you please focus on Arrow 10.0.0 release blockers for now? Thank you.
Of course, thanks for the ping. This was picked up before that issue was assigned (yesterday). I just finalized this. And I am working on the one I picked yesterday.
cc @westonpace
A few nits.
My main concern is that, rather than have
input_typesbe a dictionary, I think it would be best to have the argument names as a separate field. Otherwise it just forces the user to grapple with questions like "why is it a dictionary? What am I supposed to use for the key? Can different rows have different keys?" when that could be more intuitive.def test_multi_kernel_registration(): def unary_function(ctx, x): return pc.cast(pc.call_function("multiply", [x, 2], memory_pool=ctx.memory_pool), x.type) func_name = "y=x*2" unary_doc = {"summary": "multiply by two function", "description": "test multiply function"} input_types = [ pa.int8(), pa.int16(), pa.int32(), pa.int64(), pa.float32(), pa.float64() ] input_names = ["array"] output_types = [ pa.int8(), pa.int16(), pa.int32(), pa.int64(), pa.float32(), pa.float64() ] pc.register_scalar_function(unary_function, func_name, unary_doc, input_types, input_names, output_types)
I see your point, but I have a counter argument for this (just my opinion).
If we keep a separate list, it also makes it unclear which name goes for which type. Having them together makes it readable. We had a case for aggreagtes, where having them separate made it confusing. I am not saying it is the exact same case, but close enough. I would rather propose, documenting what we do clearly with a few examples in the docstring.
We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can.
We spoke on this briefly but I will add my comment here as well. I get your point about the aggregates. Having two vectors that have to match in length is not an ideal API. However, I still think it is better than having to consistently repeat ourselves. I'm not 100% against it so if we want to keep the status quo we can.
Yes that's a fair argument. I will work on it.
@westonpace I updated the PR and added a test case to validate the length of the input_types and input_names
I have two issues here:
- The API is ugly.
- One cannot register different implementations for different input types. This might be an annoying limitation (and/or a performance issue if people work around it by checking types in Python).
2. One cannot register different implementations for different input types. This might be an annoying limitation (and/or a performance issue if people work around it by checking types in Python).
Assuming the actual UDF implementation is in C I think the "python UDF" is just mapping to the appropriate external library call and the external library will handle the type checking itself. For example, I know we had an example at one point where we registered numpy's gcd function as a UDF. Since numpy does its own type dispatch we could use the same kernel for all integral input types.
If the actual UDF implementation is in python then I think performance doesn't matter in this context (e.g. prototyping) and they are probably mapping with something like to_pylist and so one callable would be able to handle all numeric types (e.g. in C we have int32_t and float but in python those would both just be number).
1. The API is ugly.
I'm not sure how best to address this. If I start from scratch I think I'd end up with something a bit more complex but hopefully use friendly (note, this proposal addresses your point 2 as well):
class Udf(object):
"""
A user defined function that can be registered with pyarrow so that
it can be used in expression evaluation.
Parameters
----------
name : str
The name of the function, function names must be unique
kernels: List[UdfKernel]
One or more kernels which map combinations of input types
to python callables that provide the function implementation
doc : UdfDocumentation
Optional object describing the function. Can be used in
interactive contexts to provide help to a user creating expressions
class UdfDocumentation(object):
"""
Documentation that describes a user defined function. This can be
displayed as helpful information to users that are authoring compute
expressions interactively.
summary: str, optional
A brief summary of the function, defaults to the function name
description: str, optional
An extended description that can be used to describe function details.
Defaults to an empty string.
arg_names: List[str], optional
Names to give the arguments. Defaults to ["arg0", "arg1", ...]
class UdfKernel(object):
"""
A mapping from input types to a python callable. The same python callable
can be used in multiple `UdfKernel` instances if the callable can handle arrays
of different types.
During execution of an expression Arrow will pick the appropriate kernel based
on the types of the arguments. A kernel will only be selected if the kernel's
input types match the argument types exactly.
All kernels within a Udf must have the same number of input types.
"""
input_types: List[DataType]
The input types expected by `exec`
output_type: DataType
The output type of the array returned by the `exec`
exec: Callable
The python callable to execute when this kernel is invoked
<description of how the callable will be invoked>
The example would then become:
def test_multi_kernel_registration():
def unary_function(ctx, x):
return pc.cast(pc.call_function("multiply", [x, 2],
memory_pool=ctx.memory_pool), x.type)
func_name = "y=x*2"
unary_doc = {"summary": "multiply by two function",
"description": "test multiply function"}
input_types = [
pa.int8(),
pa.int16(),
pa.int32(),
pa.int64(),
pa.float32(),
pa.float64()
]
kernels = [UdfKernel([in_type], in_type, unary_function) for in_type in input_types]
func_doc = UdfDocumentation("multiply by two function", "test multiply function", ["array"])
udf = Udf("y=x*2", kernels, doc=func_doc)
pc.register_scalar_function(udf)
@pitrou would that be a more favorable API or can you give more description of what you would like to change?
One other possibility would be to first create the function and then register kernels one by one:
udf = pc.register_scalar_function(
name="y=x*2",
arg_names=["x"],
doc = {"summary": "multiply by two function",
"description": "test multiply function"}
)
for in_type in [pa.int64(), pa.float64()]:
udf.add_kernel([in_type], in_type, impl)
It could also open the way to a decorator-like syntax:
udf = pc.register_scalar_function(
name="y=x*2",
arg_names=["x"],
doc = {"summary": "multiply by two function",
"description": "test multiply function"}
)
@udf.kernel([pa.float64()], pa.float64())
@udf.kernel([pa.int64()], pa.int64())
def impl(ctx, x):
# ...
@jorisvandenbossche Opinions on which kind of API would be desirable here?
@pitrou what approach should we take here? I completely understand your concern and will do the necessary changes to make this better.
Also one proposal, this API is still experimental, do you think it is the best to improve it right now or do it in a follow up PR.
Closing because it has been untouched for a while, in case it's still relevant feel free to reopen and move it forward 👍