incubator-gluten [VL] Unsupported spark function list [please leave a comment if you plan to pick some]

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function. You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:

[x] percentile_approx/approx_percentile (WIP, guangxin)
[x] concat_ws (PR ready, https://github.com/facebookincubator/velox/pull/8854)
[x] unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
[x] locate
[x] parse_url (PR drafted, not merged)
[x] urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
[ ] normalizenanandzero
[x] arrayintersects
[ ] default.json_split (udf, no need to impl.): "external UDF"
[ ] parsejsonarray: "external UDF"
[x] struct
[x] percentile (@Yohahaha)
[x] first/first_value (@JkSelf)
[x] last/last_value (@JkSelf)
[x] posexplode (WIP, @marin-ma)
[x] trunc (WIP, HannanKan)
[x] months_between (PR ready)
[x] date_trunc (WIP, HannanKan)
[ ] stack
[ ] grouping_id
[x] printf (@Surbhi-Vijay)
[x] space (WIP, rhh777)
[x] inline (WIP, @marin-ma)
[x] to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
[ ] from_csv
[ ] from_json
[ ] json_object_keys
[ ] json_tuple
[ ] schema_of_csv
[ ] schema_of_json
[ ] to_csv
[x] to_json (Suppose workable with folly function used)
[x] make_ym_interval (WIP, @marin-ma)
[x] make_timestamp (WIP, @marin-ma)
[ ] make_interval
[ ] make_dt_interval
[ ] monotonically_increasing_id
[x] from_utc_timestamp (@acvictor)
[ ] extract
[ ] exists (@lyy-pineapple)
[ ] date_part
[ ] zip_with
[x] transform (@Yohahaha)
[ ] transform_keys
[ ] transform_values
[x] map_from_entries (WIP, MaYan)
[x] map_filter (WIP, MaYan)
[x] map_entries (Done, by MaYan)
[ ] map_concat
[x] forall (@lyy-pineapple)
[x] flatten (@ivoson)
[ ] filter
[x] filter (array) (@ivoson)
[ ] width_bucket
[x] array_sort (@boneanxs)
[ ] xpath
[ ] xpath_boolean
[ ] xpath_double
[ ] xpath_float
[ ] xpath_int
[ ] xpath_long
[ ] xpath_number
[ ] xpath_short
[ ] xpath_string
[ ] unbase64 (WIP, @fyp711)
[ ] decode (partially supported if translated to caseWhen. WIP Cody)
[ ] initcap (WIP, velox PR: 8676)
[x] unix_date (velox PR 8725, completed)
[ ] count_min_sketch
[x] bool_and/every (@mskapilks)
[x] bool_or/any/some (@mskapilks)
[x] shuffle (completed)
[x] bround (@xumingming)
[x] format_string (@gaoyangxiaozhu)
[x] format_number (@gaoyangxiaozhu)
[x] soundex (@zhli1142015)
[x] levenshtein (@zhli1142015)
[x] cot (@honeyhexin)
[x] expm1 (@Donvi)
[x] stack (generator function, @xumingming)
[x] randn (@Donvi)
[x] empty2null (internal function, @jinchengchenghh)
[x] toprettystring (internal function, @jinchengchenghh)
[x] AtLeastNNonNulls (internal funciton, @zhli1142015)

Since Spark-3.3 (related to ML, low priority)

[ ] regr_count
[ ] regr_avgx
[ ] regr_avgy
[x] regr_r2
[ ] regr_sxx
[x] regr_sxy
[ ] regr_syy
[ ] regr_slope
[ ] regr_intercept

Since Spark-3.3
Since Spark-3.4

[ ] mode
[x] get (@Yohahaha)
[x] array_append (@ivoson)
[x] array_insert (@ivoson)
[x] mode (@zhli1142015)

Dec 14 '23 03:12 PHILO-HE

I'd like support hex and unhex.

update: hex and unhex has already supported in Gluten.

Dec 29 '23 03:12 Yohahaha

Hi i'd like to give a try with hour function.

Jan 03 '24 06:01 zwangsheng

Hi, I'd like to have a look into map_keys

Jan 04 '24 08:01 konjac

Hi I'd like to support find_in_set in velox

Jan 11 '24 09:01 fyp711

Hi, I'd like to support date_trunc/trunc.

Jan 12 '24 18:01 HannanKan

Hi, I'd like to support dense_rank.

Jan 22 '24 03:01 JkSelf

dense_rank already supported in velox https://github.com/facebookincubator/velox/pull/6289.

Jan 22 '24 06:01 JkSelf

[ ] percentile_approx

[ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Jan 22 '24 06:01 zhztheplayer

[ ] percentile_approx

[ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Yes, they are one thing. Just unify them into one checkbox. Thanks!

Jan 22 '24 07:01 PHILO-HE

I will take a look ntile window function.

Jan 22 '24 08:01 JkSelf

ubase64: https://github.com/oap-project/gluten/pull/4482

Jan 26 '24 03:01 zhouyuan

Is there any plan to suppport from_json function?

Jan 26 '24 06:01 zjuwangg

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

Jan 29 '24 07:01 yma11

I'd like to give date_from_unix_date a shot

Jan 31 '24 13:01 acvictor

Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.

to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day

Feb 21 '24 02:02 PHILO-HE

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Feb 21 '24 05:02 acvictor

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Feb 21 '24 14:02 Surbhi-Vijay

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Thanks so much for your feedback! Just removed it from the list.

Feb 22 '24 01:02 PHILO-HE

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Will do minute as well.

Feb 22 '24 07:02 acvictor

I'd like to work on locate and arrayintersect.

Feb 26 '24 08:02 rui-mo

I would like to work on bool_and, bool_or

Feb 27 '24 06:02 mskapilks

[x] collect_list (velox supported, needs Gluten to enable array for project plan node)

[x] collect_set

@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).

Feb 29 '24 05:02 zhztheplayer

I would like to give printf a try.

Mar 06 '24 14:03 Surbhi-Vijay

I would like to work on bool_and, bool_or

These are already supported it seems. All bool_and, bool_or, every, some get converted to min, max of bool column

Mar 22 '24 04:03 mskapilks

@PHILO-HE I would like to take map_filter. BTW, map_entries is completed by PR.

Mar 22 '24 05:03 yma11

@PHILO-HE , I'd like to pick up base64 and unbase64, please.

(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).

Mar 26 '24 15:03 supermem613

@PHILO-HE , I'd like to pick up base64 and unbase64, please.

(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).

Hi @supermem613, sorry for the late reply. I note Gluten PR https://github.com/apache/incubator-gluten/pull/5242 is trying to re-use Velox's existing from_base64 function (proposed for prestosql) for unbase64. Not sure whether we can map base64 to some other function. If there is no semantic difference, we can just re-use the existing Velox functions.

Apr 03 '24 09:04 PHILO-HE

Just removed the below supported functions from the above list. Thanks for the contribution! last_day, unhex, lead, lag, minute, second, may_keys

Apr 03 '24 09:04 PHILO-HE

Hi @PHILO-HE I'd like to take filter (array filter), thanks.

Apr 08 '24 07:04 ivoson

I'd like take percentile agg function.

Apr 11 '24 08:04 Yohahaha