datafusion
datafusion copied to clipboard
[Epic] Port BuiltInFunctons to `datafusion-functions-*` crates
Is your feature request related to a problem or challenge?
As part of making DataFusion even more customizable (https://github.com/apache/arrow-datafusion/issues/8045), it is valuable to let system designers mix and match different packages of functions to get the precise behavior they want (e.g. postgres style to_date
or spark style to_date
).
To support this functionality as well as to ensure the ScalarUDF
API exposes the full power of DataFusion, we are in the process of extracting the "built in" functions out of the core and into separate crates.
This epic tracks the work to actually move the functions out of the core datafusion crate (spread through datafusion_expr
and datafusion-physical-expr
and into the new datafusion-functions
/ datafusion-functions-array
crates
Tasks:
Here is list of many of the items necessary to complete this transition. Eventually there should be tickets for all tasks, and there are tickets for some already, but I don't want to make 100s of tickets all at once. I plan to make more as we make it through more of this project.
Anyone should feel free to make other tickets if they want to help with items below.
math_expressions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/math/mod.rs
- [x] https://github.com/apache/arrow-datafusion/pull/9216
- [ ] https://github.com/apache/arrow-datafusion/issues/9286
- [ ] Abs, Acos, Asin,
- [ ] Atan, Atan2, Acosh, Asinh, Atanh,
- [ ] Cbrt, Ceil, Cos, Cosh, Degrees, Exp, Factorial,
- [ ] Floor, Gcd, Lcm, Ln, Log, Log10, Log2, Pi, Power,
- [ ] Radians, Signum, Sin, Sinh, Sqrt,
- [ ] Tan, Tanh, Trunc, Cot, Round, iszero
array_expressions
Note that given the size and specialization of these functions are put in their own subcrate, datafusion-functions-array
- [x] ArrayToString https://github.com/apache/arrow-datafusion/pull/9113
- [ ] Move the
make_array
function into thedatafusion-array-expressions
crate: https://github.com/apache/arrow-datafusion/issues/9288 - [ ] ArrayAppend, ArraySort, ArrayConcat, ArrayHas, ArrayHasAll, ArrayHasAny,
- [ ] ArrayPopFront, ArrayPopBack, ArrayDims, ArrayDistinct, ArrayElement,
- [ ] ArrayEmpty, ArrayLength, ArrayNdims, ArrayPosition, ArrayPositions,
- [ ] ArrayPrepend, ArrayRemove, ArrayRemoveN, ArrayRemoveAll, ArrayRepeat,
- [ ] ArrayReplace, ArrayReplaceN, ArrayReplaceAll, ArraySlice,
- [ ] ArrayIntersect, ArrayUnion, ArrayExcept,
- [ ] Cardinality, ArrayResize, Flatten, Range, StringToArray,
- [ ]
MakeArray
: construct an array from columns (union/except depends on this)
Core functions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/core/mod.rs
- [x] Create
core
module, extractnullif
: https://github.com/apache/arrow-datafusion/pull/9216 - [ ] https://github.com/apache/arrow-datafusion/issues/9287
- [ ] Move
ArrowTypeOf
: return the arrow type of a value - [ ]
Coalesce
: return the first non-null value - [ ]
Struct
: Create a struct - [ ]
NullIf
: return null if the two values are equal - [ ]
Random
: return a random number - [ ]
Nanvl
: return the first non-NaN value
crypto_expressions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/crypto/mod.rs
- [ ] Create
crypto
module indatafusion/functions/src/crypto
andcrypto_expressions
feature flag, movedigest
function - [ ] Digest, MD5, SHA224, SHA256, SHA384, SHA512
string_expressions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/string/mod.rs
- [ ] Create
string
module indatafusion/functions/src/string
andstring_expressions
feature flag, moveascii
function - [ ] ascii, bit_length, btrim, chr,
- [ ] concat, concat_ws, ends_with, initcap,
- [ ] instr, lower, ltrim, octet_length,
- [ ] repeat, replace, rtrim, split_part,
- [ ] starts_with, to_hex, trim, upper,
- [ ] levenshtein, uuid, overlay
unicode_expressions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/unicode/mod.rs
- [ ] Create
unicode
module indatafusion/functions/src/unicode
andunicode_expressions
feature flag, movecharlength
function - [ ] CharLength,
- [ ] Left, Lpad, Reverse, Right, Rpad,
- [ ] Strpos, Substr,
- [ ] Translate, SubstrIndex, FindInSet
regex_expressions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/regexp/mod.rs
- [ ] Create
regex
module indatafusion/functions/src/regex
andregex_expressions
feature flag, move regexp_match - [ ] RegexpMatch, RegexpReplace
- [ ] RegexpLike
datetime_expressions
These should be located in the datafusion-functions
crate (source link)
Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/datetime/mod.rs
- [ ] Create
datetime
module indatafusion/functions/src/datetime
anddatetime_expressions
feature flag, movedate_part
- [ ] port benchmarks to datafusion-functions crate
- [ ] date_part, date_trunc, date_bin,
- [ ] to_timestamp, to_timestamp_millis, to_timestamp_micros, to_timestamp_nanos, to_timestamp_seconds,
- [ ] from_unixtime, now, current_date, current_time
Infrastructure
- [ ] https://github.com/apache/arrow-datafusion/issues/9289
Describe alternatives you've considered
No response
Additional context
The organization was discussed in https://github.com/apache/arrow-datafusion/issues/9100
Let me give this a try:
math_expressions
[ ] Abs, Acos, Asin,
Update: Abs
has been taken in #9286, I will handle the remaining 2 fns
I'll try to do regex_expressions parts.
I'll try to do array_expressions parts.
I filed #9336 which I think may be necessary before we port functions.
work on array_concat
work on
array_concat
hi @jayzhan211 I already work on array_concat
#9343
Oh, sorry. I will work on other
Take ArrayHas, ArrayHasAll, ArrayHasAny
Take Atan, Atan2, Acosh.
Take ArrayPopFront, ArrayPopBack, ArrayDistinct, ArrayElement
Take ArrayDims, ArrayNdims, Cardinality, ArrayNdims #9425
Take ArrayHas, ArrayHasAll, ArrayHasAny
I think this needs to wait until make_array
is done and also move to rewrite to the optimizer.
Take ArrayIntersect, ArrayUnion, ArrayExcept
Edit: general_set_op needs make_array too
Take: ArraySlice + ArrayElement
While I worked on ArrayElement, I found that GetFieldAccessExpr::ListIndex
uses array_element
in physical-expr https://github.com/apache/arrow-datafusion/blob/581fd98270e69221689ddd4c37566ef620a67167/datafusion/physical-expr/src/expressions/get_indexed_field.rs#L252-L257 ListRange likewise, which is kernel function in function-arrays
crate, instead of converting ArrayRef to Expr, I'm thinking of making array function kernel public. @alamb Do you think we can make those array functions public? And, is importing function-arrays
to crate physical-expr
a concern like crate optimize
is?
If neither is a good idea, we may need to rewrite ListIndex and ListRange to array function early.
ListRange likewise, which is kernel function in
function-arrays
crate, instead of converting ArrayRef to Expr, I'm thinking of making array function kernel public. @alamb Do you think we can make those array functions public? And, is importingfunction-arrays
to cratephysical-expr
a concern like crateoptimize
is?
I think making physical-exprs depend on datafusion-functions would be unfortunate (as it would still mean there could be dependencies / special cases from built in functions to "built in" user defined functions)
If neither is a good idea, we may need to rewrite ListIndex and ListRange to array function early.
I agree if this is possible it would be the best approach from my perspective
I also though about about potentially adding the common code to the new datafusion_common_runtime
crate (that @mustafasrepo added recently)
https://github.com/apache/arrow-datafusion/tree/main/datafusion/common_runtime
But I think that would result in a tokio dependency on datafusion-functions which would be unwise.
I think making physical-exprs depend on datafusion-functions would be unfortunate (as it would still mean there could be dependencies / special cases from built in functions to "built in" user defined functions)
If neither is a good idea, we may need to rewrite ListIndex and ListRange to array function early.
I agree if this is possible it would be the best approach from my perspective
Hmm. I had the same issue with datebin being referenced in physical-expr - though only for a test which likely could be moved elsewhere or somehow adjusted in some other way. The draft pr in fact does just replace it with the moved datebin and thus does import the functions crate. I'll look at updating those tests in some way to remove that dependency.
I also encountered function references in the sql crate that I really can't see a way to work around unless we somehow allow SQLExpr to be a part of a ScalarUdf or similar.
I also encountered function references in the sql crate that I really can't see a way to work around unless we somehow allow SQLExpr to be a part of a ScalarUdf or similar.
I was thinking in the sql parser we can just switch to lookung up functions by name (and the sql planner already has a function registry)
Hmm. I had the same issue with datebin being referenced in physical-expr
Perhaps it is time to make a new crate that contains common function implementation code that is shared by datafusion-functions and datafusion-physical-expr 🤔
datafusion-functions-common
? datafusion-common-functons
?
Maybe we are going too crate crazy to avoid dependencies 🤔
Take crypto parts
Take ArrayEmpty and ArrayLength #9510
Take Flatten
Take ArrowTypeOf
Take Tan, Tanh
take string module and ascii, bit_length, btrim, chr
take string module starts_with, to_hex, trim, upper
take concat, concat_ws, ends_with, initcap,
take Struct in Core
ArraySort
and ArrayDistinct
are also being addressed by following PRs: https://github.com/apache/arrow-datafusion/pull/9551 and https://github.com/apache/arrow-datafusion/pull/9549 so they can also be linked to above epic status table. Thanks.
@erenavsarogullari I think it is fine to split function to other file (even single func per file) to avoid crazy conflict.
@jayzhan211 Thanks for notification. Yes, makes sense. We can define new udf
definitions as separate file (single func per file). WDYT on datafusion/functions-array/src/kernels.rs
(Its size also increases)?
WDYT on
datafusion/functions-array/src/kernels.rs
(Its size also increases)?
let's move them out too!
Take pop_front
and pop_back
, they need to be moved together with slice