[VL] Unsupported spark function list [please leave a comment if you plan to pick some]
Description
Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function. You can find all functions' support status from this gluten doc.
To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.
Reference:
- [x] percentile_approx/approx_percentile (WIP, guangxin)
- [x] concat_ws (PR ready, https://github.com/facebookincubator/velox/pull/8854)
- [x] unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
- [x] locate
- [x] parse_url (PR drafted, not merged)
- [x] urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
- [ ] normalizenanandzero
- [x] arrayintersects
- [ ] default.json_split (udf, no need to impl.): "external UDF"
- [ ] parsejsonarray: "external UDF"
- [x] struct
- [x] percentile (@Yohahaha)
- [x] first/first_value (@JkSelf)
- [x] last/last_value (@JkSelf)
- [x] posexplode (WIP, @marin-ma)
- [x] trunc (WIP, HannanKan)
- [x] months_between (PR ready)
- [x] date_trunc (WIP, HannanKan)
- [ ] stack
- [ ] grouping_id
- [x] printf (@Surbhi-Vijay)
- [x] space (WIP, rhh777)
- [x] inline (WIP, @marin-ma)
- [x] to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
- [ ] from_csv
- [ ] from_json
- [ ] json_object_keys
- [ ] json_tuple
- [ ] schema_of_csv
- [ ] schema_of_json
- [ ] to_csv
- [x] to_json (Suppose workable with folly function used)
- [x] make_ym_interval (WIP, @marin-ma)
- [x] make_timestamp (WIP, @marin-ma)
- [ ] make_interval
- [ ] make_dt_interval
- [ ] monotonically_increasing_id
- [x] from_utc_timestamp (@acvictor)
- [ ] extract
- [ ] exists (@lyy-pineapple)
- [ ] date_part
- [ ] zip_with
- [x] transform (@Yohahaha)
- [ ] transform_keys
- [ ] transform_values
- [x] map_from_entries (WIP, MaYan)
- [x] map_filter (WIP, MaYan)
- [x] map_entries (Done, by MaYan)
- [ ] map_concat
- [x] forall (@lyy-pineapple)
- [x] flatten (@ivoson)
- [ ] filter
- [x] filter (array) (@ivoson)
- [ ] width_bucket
- [x] array_sort (@boneanxs)
- [ ] xpath
- [ ] xpath_boolean
- [ ] xpath_double
- [ ] xpath_float
- [ ] xpath_int
- [ ] xpath_long
- [ ] xpath_number
- [ ] xpath_short
- [ ] xpath_string
- [ ] unbase64 (WIP, @fyp711)
- [ ] decode (partially supported if translated to caseWhen. WIP Cody)
- [ ] initcap (WIP, velox PR: 8676)
- [x] unix_date (velox PR 8725, completed)
- [ ] count_min_sketch
- [x] bool_and/every (@mskapilks)
- [x] bool_or/any/some (@mskapilks)
- [x] shuffle (completed)
- [x] bround (@xumingming)
- [x] format_string (@gaoyangxiaozhu)
- [x] format_number (@gaoyangxiaozhu)
- [x] soundex (@zhli1142015)
- [x] levenshtein (@zhli1142015)
- [x] cot (@honeyhexin)
- [x] expm1 (@Donvi)
- [x] stack (generator function, @xumingming)
- [x] randn (@Donvi)
- [x] empty2null (internal function, @jinchengchenghh)
- [x] toprettystring (internal function, @jinchengchenghh)
- [x] AtLeastNNonNulls (internal funciton, @zhli1142015)
- Since Spark-3.3 (related to ML, low priority)
- [ ] regr_count
- [ ] regr_avgx
- [ ] regr_avgy
- [x] regr_r2
- [ ] regr_sxx
- [x] regr_sxy
- [ ] regr_syy
- [ ] regr_slope
- [ ] regr_intercept
-
Since Spark-3.3
-
Since Spark-3.4
- [ ] mode
- [x] get (@Yohahaha)
- [x] array_append (@ivoson)
- [x] array_insert (@ivoson)
- [x] mode (@zhli1142015)
I'd like support hex and unhex.
update: hex and unhex has already supported in Gluten.
Hi i'd like to give a try with hour function.
Hi, I'd like to have a look into map_keys
Hi I'd like to support find_in_set in velox
Hi, I'd like to support date_trunc/trunc.
Hi, I'd like to support dense_rank.
dense_rank already supported in velox https://github.com/facebookincubator/velox/pull/6289.
- [ ] percentile_approx
- [ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"
The two stand for the same function I assume? I'll take these two if nobody is working on it.
- [ ] percentile_approx
- [ ] approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"
The two stand for the same function I assume? I'll take these two if nobody is working on it.
Yes, they are one thing. Just unify them into one checkbox. Thanks!
I will take a look ntile window function.
ubase64: https://github.com/oap-project/gluten/pull/4482
Is there any plan to suppport from_json function?
I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .
I'd like to give date_from_unix_date a shot
Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.
to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day
@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.
nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.
nullifis out of the box supported. Spark send the converted expression asIfexpression and it is supported in Gluten.
Thanks so much for your feedback! Just removed it from the list.
@PHILO-HE I see support for
year,month,day,last_dayin Velox too. I can also givefrom_utc_timestampa go.
Will do minute as well.
I'd like to work on locate and arrayintersect.
I would like to work on bool_and, bool_or
- [x] collect_list (velox supported, needs Gluten to enable array for project plan node)
- [x] collect_set
@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).
I would like to give printf a try.
I would like to work on
bool_and,bool_or
These are already supported it seems. All bool_and, bool_or, every, some get converted to min, max of bool column
@PHILO-HE I would like to take map_filter. BTW, map_entries is completed by PR.
@PHILO-HE , I'd like to pick up base64 and unbase64, please.
(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).
@PHILO-HE , I'd like to pick up base64 and unbase64, please.
(FYI, looks like there was a PR above for unbase64, but it seems to have been closed without committing ~45-55 days ago, so hopefully I am not conflicting with any work).
Hi @supermem613, sorry for the late reply. I note Gluten PR https://github.com/apache/incubator-gluten/pull/5242 is trying to re-use Velox's existing from_base64 function (proposed for prestosql) for unbase64. Not sure whether we can map base64 to some other function. If there is no semantic difference, we can just re-use the existing Velox functions.
Just removed the below supported functions from the above list. Thanks for the contribution!
last_day, unhex, lead, lag, minute, second, may_keys
Hi @PHILO-HE I'd like to take filter (array filter), thanks.
I'd like take percentile agg function.