datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[Epic] Port BuiltInFunctons to `datafusion-functions-*` crates

Open alamb opened this issue 1 year ago • 3 comments

Is your feature request related to a problem or challenge?

As part of making DataFusion even more customizable (https://github.com/apache/arrow-datafusion/issues/8045), it is valuable to let system designers mix and match different packages of functions to get the precise behavior they want (e.g. postgres style to_date or spark style to_date).

To support this functionality as well as to ensure the ScalarUDF API exposes the full power of DataFusion, we are in the process of extracting the "built in" functions out of the core and into separate crates.

This epic tracks the work to actually move the functions out of the core datafusion crate (spread through datafusion_expr and datafusion-physical-expr and into the new datafusion-functions / datafusion-functions-array crates

Tasks:

Here is list of many of the items necessary to complete this transition. Eventually there should be tickets for all tasks, and there are tickets for some already, but I don't want to make 100s of tickets all at once. I plan to make more as we make it through more of this project.

Anyone should feel free to make other tickets if they want to help with items below.

math_expressions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/math/mod.rs

  • [x] https://github.com/apache/arrow-datafusion/pull/9216
  • [ ] https://github.com/apache/arrow-datafusion/issues/9286
  • [ ] Abs, Acos, Asin,
  • [ ] Atan, Atan2, Acosh, Asinh, Atanh,
  • [ ] Cbrt, Ceil, Cos, Cosh, Degrees, Exp, Factorial,
  • [ ] Floor, Gcd, Lcm, Ln, Log, Log10, Log2, Pi, Power,
  • [ ] Radians, Signum, Sin, Sinh, Sqrt,
  • [ ] Tan, Tanh, Trunc, Cot, Round, iszero

array_expressions

Note that given the size and specialization of these functions are put in their own subcrate, datafusion-functions-array

  • [x] ArrayToString https://github.com/apache/arrow-datafusion/pull/9113
  • [ ] Move the make_array function into the datafusion-array-expressions crate: https://github.com/apache/arrow-datafusion/issues/9288
  • [ ] ArrayAppend, ArraySort, ArrayConcat, ArrayHas, ArrayHasAll, ArrayHasAny,
  • [ ] ArrayPopFront, ArrayPopBack, ArrayDims, ArrayDistinct, ArrayElement,
  • [ ] ArrayEmpty, ArrayLength, ArrayNdims, ArrayPosition, ArrayPositions,
  • [ ] ArrayPrepend, ArrayRemove, ArrayRemoveN, ArrayRemoveAll, ArrayRepeat,
  • [ ] ArrayReplace, ArrayReplaceN, ArrayReplaceAll, ArraySlice,
  • [ ] ArrayIntersect, ArrayUnion, ArrayExcept,
  • [ ] Cardinality, ArrayResize, Flatten, Range, StringToArray,
  • [ ] MakeArray: construct an array from columns (union/except depends on this)

Core functions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/core/mod.rs

  • [x] Create core module, extract nullif: https://github.com/apache/arrow-datafusion/pull/9216
  • [ ] https://github.com/apache/arrow-datafusion/issues/9287
  • [ ] Move ArrowTypeOf: return the arrow type of a value
  • [ ] Coalesce: return the first non-null value
  • [ ] Struct: Create a struct
  • [ ] NullIf: return null if the two values are equal
  • [ ] Random: return a random number
  • [ ] Nanvl: return the first non-NaN value

crypto_expressions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/crypto/mod.rs

  • [ ] Create crypto module in datafusion/functions/src/crypto and crypto_expressions feature flag, move digest function
  • [ ] Digest, MD5, SHA224, SHA256, SHA384, SHA512

string_expressions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/string/mod.rs

  • [ ] Create string module in datafusion/functions/src/string and string_expressions feature flag, move ascii function
  • [ ] ascii, bit_length, btrim, chr,
  • [ ] concat, concat_ws, ends_with, initcap,
  • [ ] instr, lower, ltrim, octet_length,
  • [ ] repeat, replace, rtrim, split_part,
  • [ ] starts_with, to_hex, trim, upper,
  • [ ] levenshtein, uuid, overlay

unicode_expressions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/unicode/mod.rs

  • [ ] Create unicode module in datafusion/functions/src/unicode and unicode_expressions feature flag, move charlength function
  • [ ] CharLength,
  • [ ] Left, Lpad, Reverse, Right, Rpad,
  • [ ] Strpos, Substr,
  • [ ] Translate, SubstrIndex, FindInSet

regex_expressions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/regexp/mod.rs

  • [ ] Create regex module in datafusion/functions/src/regex and regex_expressions feature flag, move regexp_match
  • [ ] RegexpMatch, RegexpReplace
  • [ ] RegexpLike

datetime_expressions

These should be located in the datafusion-functions crate (source link) Code location: https://github.com/apache/arrow-datafusion/blob/main/datafusion/functions/src/datetime/mod.rs

  • [ ] Create datetime module in datafusion/functions/src/datetime and datetime_expressions feature flag, move date_part
  • [ ] port benchmarks to datafusion-functions crate
  • [ ] date_part, date_trunc, date_bin,
  • [ ] to_timestamp, to_timestamp_millis, to_timestamp_micros, to_timestamp_nanos, to_timestamp_seconds,
  • [ ] from_unixtime, now, current_date, current_time

Infrastructure

  • [ ] https://github.com/apache/arrow-datafusion/issues/9289

Describe alternatives you've considered

No response

Additional context

The organization was discussed in https://github.com/apache/arrow-datafusion/issues/9100

alamb avatar Feb 20 '24 07:02 alamb

Let me give this a try:

math_expressions

[ ] Abs, Acos, Asin,

Update: Abs has been taken in #9286, I will handle the remaining 2 fns

SteveLauC avatar Feb 21 '24 00:02 SteveLauC

I'll try to do regex_expressions parts.

Lordworms avatar Feb 21 '24 16:02 Lordworms

I'll try to do array_expressions parts.

guojidan avatar Feb 23 '24 11:02 guojidan

I filed #9336 which I think may be necessary before we port functions.

yyy1000 avatar Feb 24 '24 20:02 yyy1000

work on array_concat

jayzhan211 avatar Feb 27 '24 11:02 jayzhan211

work on array_concat

hi @jayzhan211 I already work on array_concat #9343

guojidan avatar Feb 27 '24 11:02 guojidan

Oh, sorry. I will work on other

jayzhan211 avatar Feb 27 '24 11:02 jayzhan211

Take ArrayHas, ArrayHasAll, ArrayHasAny

jayzhan211 avatar Feb 27 '24 12:02 jayzhan211

Take Atan, Atan2, Acosh.

SteveLauC avatar Feb 28 '24 02:02 SteveLauC

Take ArrayPopFront, ArrayPopBack, ArrayDistinct, ArrayElement

Weijun-H avatar Feb 29 '24 10:02 Weijun-H

Take ArrayDims, ArrayNdims, Cardinality, ArrayNdims #9425

Weijun-H avatar Mar 02 '24 04:03 Weijun-H

Take ArrayHas, ArrayHasAll, ArrayHasAny

I think this needs to wait until make_array is done and also move to rewrite to the optimizer.

Take ArrayIntersect, ArrayUnion, ArrayExcept

Edit: general_set_op needs make_array too

Take: ArraySlice + ArrayElement

jayzhan211 avatar Mar 04 '24 11:03 jayzhan211

While I worked on ArrayElement, I found that GetFieldAccessExpr::ListIndex uses array_element in physical-expr https://github.com/apache/arrow-datafusion/blob/581fd98270e69221689ddd4c37566ef620a67167/datafusion/physical-expr/src/expressions/get_indexed_field.rs#L252-L257 ListRange likewise, which is kernel function in function-arrays crate, instead of converting ArrayRef to Expr, I'm thinking of making array function kernel public. @alamb Do you think we can make those array functions public? And, is importing function-arrays to crate physical-expr a concern like crate optimize is?

If neither is a good idea, we may need to rewrite ListIndex and ListRange to array function early.

jayzhan211 avatar Mar 04 '24 14:03 jayzhan211

ListRange likewise, which is kernel function in function-arrays crate, instead of converting ArrayRef to Expr, I'm thinking of making array function kernel public. @alamb Do you think we can make those array functions public? And, is importing function-arrays to crate physical-expr a concern like crate optimize is?

I think making physical-exprs depend on datafusion-functions would be unfortunate (as it would still mean there could be dependencies / special cases from built in functions to "built in" user defined functions)

If neither is a good idea, we may need to rewrite ListIndex and ListRange to array function early.

I agree if this is possible it would be the best approach from my perspective

I also though about about potentially adding the common code to the new datafusion_common_runtime crate (that @mustafasrepo added recently)

https://github.com/apache/arrow-datafusion/tree/main/datafusion/common_runtime

But I think that would result in a tokio dependency on datafusion-functions which would be unwise.

alamb avatar Mar 04 '24 22:03 alamb

I think making physical-exprs depend on datafusion-functions would be unfortunate (as it would still mean there could be dependencies / special cases from built in functions to "built in" user defined functions)

If neither is a good idea, we may need to rewrite ListIndex and ListRange to array function early.

I agree if this is possible it would be the best approach from my perspective

Hmm. I had the same issue with datebin being referenced in physical-expr - though only for a test which likely could be moved elsewhere or somehow adjusted in some other way. The draft pr in fact does just replace it with the moved datebin and thus does import the functions crate. I'll look at updating those tests in some way to remove that dependency.

I also encountered function references in the sql crate that I really can't see a way to work around unless we somehow allow SQLExpr to be a part of a ScalarUdf or similar.

Omega359 avatar Mar 04 '24 22:03 Omega359

I also encountered function references in the sql crate that I really can't see a way to work around unless we somehow allow SQLExpr to be a part of a ScalarUdf or similar.

I was thinking in the sql parser we can just switch to lookung up functions by name (and the sql planner already has a function registry)

Hmm. I had the same issue with datebin being referenced in physical-expr

Perhaps it is time to make a new crate that contains common function implementation code that is shared by datafusion-functions and datafusion-physical-expr 🤔

datafusion-functions-common ? datafusion-common-functons?

Maybe we are going too crate crazy to avoid dependencies 🤔

alamb avatar Mar 04 '24 23:03 alamb

Take crypto parts

Lordworms avatar Mar 09 '24 01:03 Lordworms

Take ArrayEmpty and ArrayLength #9510

Weijun-H avatar Mar 09 '24 03:03 Weijun-H

Take Flatten

Weijun-H avatar Mar 10 '24 03:03 Weijun-H

Take ArrowTypeOf

yyy1000 avatar Mar 10 '24 04:03 yyy1000

Take Tan, Tanh

ongchi avatar Mar 10 '24 11:03 ongchi

take string module and ascii, bit_length, btrim, chr

PsiACE avatar Mar 10 '24 12:03 PsiACE

take string module starts_with, to_hex, trim, upper

Tangruilin avatar Mar 10 '24 15:03 Tangruilin

take concat, concat_ws, ends_with, initcap,

Tangruilin avatar Mar 10 '24 15:03 Tangruilin

take Struct in Core

yyy1000 avatar Mar 10 '24 17:03 yyy1000

ArraySort and ArrayDistinct are also being addressed by following PRs: https://github.com/apache/arrow-datafusion/pull/9551 and https://github.com/apache/arrow-datafusion/pull/9549 so they can also be linked to above epic status table. Thanks.

erenavsarogullari avatar Mar 11 '24 06:03 erenavsarogullari

@erenavsarogullari I think it is fine to split function to other file (even single func per file) to avoid crazy conflict.

jayzhan211 avatar Mar 13 '24 03:03 jayzhan211

@jayzhan211 Thanks for notification. Yes, makes sense. We can define new udf definitions as separate file (single func per file). WDYT on datafusion/functions-array/src/kernels.rs (Its size also increases)?

erenavsarogullari avatar Mar 13 '24 04:03 erenavsarogullari

WDYT on datafusion/functions-array/src/kernels.rs (Its size also increases)?

let's move them out too!

jayzhan211 avatar Mar 13 '24 05:03 jayzhan211

Take pop_front and pop_back, they need to be moved together with slice

jayzhan211 avatar Mar 15 '24 00:03 jayzhan211