ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(api): Vectorized UDAFs

Open ogrisel opened this issue 2 years ago • 12 comments

duckdb does not support scalar User Defined Functions written in Python (to be applied one record at a time) but it does expose a vector Python UDF via the map method method:

>>> import pandas as pd
>>> import duckdb
>>> df = pd.DataFrame({"x": range(int(1e4))})
>>> def process_chunk(df_chunk):
...     print(f"processing chunk of size {df_chunk.shape[0]}")
...     return df_chunk * 2
... 
>>> duckdb.from_df(df).map(process_chunk).to_df()
processing chunk of size 0
processing chunk of size 0
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 1024
processing chunk of size 784
          x
0         0
1         2
2         4
3         6
4         8
...     ...
9995  19990
9996  19992
9997  19994
9998  19996
9999  19998

[10000 rows x 1 columns]

The main motivation for this vector Python UDF API is probably to hide the per-record Python function call overhead. I think it's a pragmatic API and it would allow to efficiently deploy trained machine learning models for batch scoring in out-of-core manner for instance.

Any chance to expose such vector Python UDFs via the Ibis API?

Also if some backends include or add support Python UDAFs (especially with in parallel via combiners in addtion to mappers and reducers), this would open the possibility to train machine learning models (e.g. with scikit-learn or Pytorch) directly via Ibis. As far as I know, duckdb does not expose parallel Python UDAFs unfortunately.

Final side-request: for backends who only support scalar UDFs, would it be possible for Ibis to generate the SQL required to do the chunking itself and expose a vector UDF API to hide the Python function call overhead similarly to what duckdb is doing internally with map?

ogrisel avatar Oct 23 '22 14:10 ogrisel