incubator-gluten icon indicating copy to clipboard operation
incubator-gluten copied to clipboard

[VL] Unsupported spark function list [please leave a comment if you plan to pick some]

Open PHILO-HE opened this issue 2 years ago • 72 comments

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function. You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:


  • [x] percentile_approx/approx_percentile (WIP, guangxin)
  • [x] concat_ws (PR ready, https://github.com/facebookincubator/velox/pull/8854)
  • [x] unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
  • [x] locate
  • [x] parse_url (PR drafted, not merged)
  • [x] urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
  • [ ] normalizenanandzero
  • [x] arrayintersects
  • [ ] default.json_split (udf, no need to impl.): "external UDF"
  • [ ] parsejsonarray: "external UDF"
  • [x] struct
  • [x] percentile (@Yohahaha)
  • [x] first/first_value (@JkSelf)
  • [x] last/last_value (@JkSelf)
  • [x] posexplode (WIP, @marin-ma)
  • [x] trunc (WIP, HannanKan)
  • [x] months_between (PR ready)
  • [x] date_trunc (WIP, HannanKan)
  • [ ] stack
  • [ ] grouping_id
  • [x] printf (@Surbhi-Vijay)
  • [x] space (WIP, rhh777)
  • [x] inline (WIP, @marin-ma)
  • [x] to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
  • [ ] from_csv
  • [ ] from_json
  • [ ] json_object_keys
  • [ ] json_tuple
  • [ ] schema_of_csv
  • [ ] schema_of_json
  • [ ] to_csv
  • [x] to_json (Suppose workable with folly function used)
  • [x] make_ym_interval (WIP, @marin-ma)
  • [x] make_timestamp (WIP, @marin-ma)
  • [ ] make_interval
  • [ ] make_dt_interval
  • [ ] monotonically_increasing_id
  • [x] from_utc_timestamp (@acvictor)
  • [ ] extract
  • [ ] exists (@lyy-pineapple)
  • [ ] date_part
  • [ ] zip_with
  • [x] transform (@Yohahaha)
  • [ ] transform_keys
  • [ ] transform_values
  • [x] map_from_entries (WIP, MaYan)
  • [x] map_filter (WIP, MaYan)
  • [x] map_entries (Done, by MaYan)
  • [ ] map_concat
  • [x] forall (@lyy-pineapple)
  • [x] flatten (@ivoson)
  • [ ] filter
  • [x] filter (array) (@ivoson)
  • [ ] width_bucket
  • [x] array_sort (@boneanxs)
  • [ ] xpath
  • [ ] xpath_boolean
  • [ ] xpath_double
  • [ ] xpath_float
  • [ ] xpath_int
  • [ ] xpath_long
  • [ ] xpath_number
  • [ ] xpath_short
  • [ ] xpath_string
  • [ ] unbase64 (WIP, @fyp711)
  • [ ] decode (partially supported if translated to caseWhen. WIP Cody)
  • [ ] initcap (WIP, velox PR: 8676)
  • [x] unix_date (velox PR 8725, completed)
  • [ ] count_min_sketch
  • [x] bool_and/every (@mskapilks)
  • [x] bool_or/any/some (@mskapilks)
  • [x] shuffle (completed)
  • [x] bround (@xumingming)
  • [x] format_string (@gaoyangxiaozhu)
  • [x] format_number (@gaoyangxiaozhu)
  • [x] soundex (@zhli1142015)
  • [x] levenshtein (@zhli1142015)
  • [x] cot (@honeyhexin)
  • [x] expm1 (@Donvi)
  • [x] stack (generator function, @xumingming)
  • [x] randn (@Donvi)
  • [x] empty2null (internal function, @jinchengchenghh)
  • [x] toprettystring (internal function, @jinchengchenghh)
  • [x] AtLeastNNonNulls (internal funciton, @zhli1142015)
  • Since Spark-3.3 (related to ML, low priority)
  • [ ] regr_count
  • [ ] regr_avgx
  • [ ] regr_avgy
  • [x] regr_r2
  • [ ] regr_sxx
  • [x] regr_sxy
  • [ ] regr_syy
  • [ ] regr_slope
  • [ ] regr_intercept
  • Since Spark-3.3

  • Since Spark-3.4

  • [ ] mode
  • [x] get (@Yohahaha)
  • [x] array_append (@ivoson)
  • [x] array_insert (@ivoson)
  • [x] mode (@zhli1142015)

PHILO-HE avatar Dec 14 '23 03:12 PHILO-HE

I'd like support hex and unhex.

update: hex and unhex has already supported in Gluten.

Yohahaha avatar Dec 29 '23 03:12 Yohahaha

Hi i'd like to give a try with hour function.

zwangsheng avatar Jan 03 '24 06:01 zwangsheng

Hi, I'd like to have a look into map_keys

konjac avatar Jan 04 '24 08:01 konjac

Hi I'd like to support find_in_set in velox

fyp711 avatar Jan 11 '24 09:01 fyp711

Hi, I'd like to support date_trunc/trunc.

HannanKan avatar Jan 12 '24 18:01 HannanKan

Hi, I'd like to support dense_rank.

JkSelf avatar Jan 22 '24 03:01 JkSelf

dense_rank already supported in velox https://github.com/facebookincubator/velox/pull/6289.

JkSelf avatar Jan 22 '24 06:01 JkSelf

  • [ ] percentile_approx
  • [ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

zhztheplayer avatar Jan 22 '24 06:01 zhztheplayer

  • [ ] percentile_approx
  • [ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Yes, they are one thing. Just unify them into one checkbox. Thanks!

PHILO-HE avatar Jan 22 '24 07:01 PHILO-HE

I will take a look ntile window function.

JkSelf avatar Jan 22 '24 08:01 JkSelf

ubase64: https://github.com/oap-project/gluten/pull/4482

zhouyuan avatar Jan 26 '24 03:01 zhouyuan

Is there any plan to suppport from_json function?

zjuwangg avatar Jan 26 '24 06:01 zjuwangg

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

yma11 avatar Jan 29 '24 07:01 yma11

I'd like to give date_from_unix_date a shot

acvictor avatar Jan 31 '24 13:01 acvictor

Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.

to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day

PHILO-HE avatar Feb 21 '24 02:02 PHILO-HE

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

acvictor avatar Feb 21 '24 05:02 acvictor

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Surbhi-Vijay avatar Feb 21 '24 14:02 Surbhi-Vijay

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Thanks so much for your feedback! Just removed it from the list.

PHILO-HE avatar Feb 22 '24 01:02 PHILO-HE

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Will do minute as well.

acvictor avatar Feb 22 '24 07:02 acvictor

I'd like to work on locate and arrayintersect.

rui-mo avatar Feb 26 '24 08:02 rui-mo

I would like to work on bool_and, bool_or

mskapilks avatar Feb 27 '24 06:02 mskapilks

  • [x] collect_list (velox supported, needs Gluten to enable array for project plan node)
  • [x] collect_set

@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).

zhztheplayer avatar Feb 29 '24 05:02 zhztheplayer

I would like to give printf a try.

Surbhi-Vijay avatar Mar 06 '24 14:03 Surbhi-Vijay

I would like to work on bool_and, bool_or

These are already supported it seems. All bool_and, bool_or, every, some get converted to min, max of bool column

mskapilks avatar Mar 22 '24 04:03 mskapilks

@PHILO-HE I would like to take map_filter. BTW, map_entries is completed by PR.

yma11 avatar Mar 22 '24 05:03 yma11

@PHILO-HE , I'd like to pick up base64 and unbase64, please.

(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).

supermem613 avatar Mar 26 '24 15:03 supermem613

@PHILO-HE , I'd like to pick up base64 and unbase64, please.

(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).

Hi @supermem613, sorry for the late reply. I note Gluten PR https://github.com/apache/incubator-gluten/pull/5242 is trying to re-use Velox's existing from_base64 function (proposed for prestosql) for unbase64. Not sure whether we can map base64 to some other function. If there is no semantic difference, we can just re-use the existing Velox functions.

PHILO-HE avatar Apr 03 '24 09:04 PHILO-HE

Just removed the below supported functions from the above list. Thanks for the contribution! last_day, unhex, lead, lag, minute, second, may_keys

PHILO-HE avatar Apr 03 '24 09:04 PHILO-HE

Hi @PHILO-HE I'd like to take filter (array filter), thanks.

ivoson avatar Apr 08 '24 07:04 ivoson

I'd like take percentile agg function.

Yohahaha avatar Apr 11 '24 08:04 Yohahaha