ibis
ibis copied to clipboard
feat: support pyarrow UDFs for pyspark backend
Is your feature request related to a problem?
Pyspark now supports Arrow UDFs that facilitate efficient row-by-row executions using Arrow as a backend e.g.
import pandas as pd
from pyspark.sql.functions import udf
@udf(returnType="int",useArrow=True)
def add_one(x:int) -> int:
return x + 1
#Create column using pyarrow-udf
df = pd.DataFrame({"a":[1,2,3]})
dfs = spark.createDataFrame(df)
dfs.withColumn("b", add_one("a")).show()
However, the equivalent function using ibis
raises a NotImplementedError, because only Pandas-based vectorized UDFs are supported
import ibis
from ibis import _
@ibis.udf.scalar.pyarrow
def add_one_pyarrow(x:int) -> int:
return x + 1
@ibis.udf.scalar.pandas
def add_one_pandas(x:int) -> int:
return x + 1
con = ibis.pyspark.connect(spark)
con.create_table("df", df, format="delta", overwrite=True)
table = con.table("df")
table.mutate(b=add_one_pandas(_.a)).execute()
table.mutate(b=add_one_pyarrow(_.a)).execute() #raises NotImpletmentedError
What is the motivation behind your request?
Pandas-based UDFs are not supported for the DuckDB backend, but Arrow-based ones are. For my use case, I would like to ensure parity between using either backend as much as possible, so being able to use Arrow-based UDFs on a pyspark table would be very useful
Describe the solution you'd like
I'd like a solution that would allow me to use an Arrow-based UDF on a pyspark table
What version of ibis are you running?
9.0.0.dev686
What backend(s) are you using, if any?
Pyspark
Code of Conduct
- [X] I agree to follow this project's Code of Conduct