velox icon indicating copy to clipboard operation
velox copied to clipboard

Support regex delimiter and limit argument for Spark split function

Open gaoyangxiaozhu opened this issue 1 year ago • 12 comments

Refactor spark split function to align with vinalia spark split function

  1. Support split using any regular expression (constant or non-constant)
  2. Support limit parameter , limit can also <=0

https://spark.apache.org/docs/latest/api/sql/#split

gaoyangxiaozhu avatar Jun 18 '24 08:06 gaoyangxiaozhu

@rui-mo / @PHILO-HE for help review.

gaoyangxiaozhu avatar Jun 18 '24 08:06 gaoyangxiaozhu

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 2a399d72d9fd095efd1f464cacfb1d2d8df75152
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/66b32e174ee1b7000855e17a

netlify[bot] avatar Jun 18 '24 08:06 netlify[bot]

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function https://github.com/facebookincubator/velox/pull/6155, and maybe we need to confirm its status first. cc: @PHILO-HE

rui-mo avatar Jun 19 '24 02:06 rui-mo

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

oh, i see, i didn't go through the PRs to check if already people have touch this work. sorry.

gaoyangxiaozhu avatar Jun 19 '24 08:06 gaoyangxiaozhu

@gaoyangxiaozhu Thanks for your work. There is an existing PR supporting regex in split function #6155, and maybe we need to confirm its status first. cc: @PHILO-HE

btw, @rui-mo / @PHILO-HE . looks https://github.com/facebookincubator/velox/pull/6155 only make support for constant pattern, not cover for non-constant pattern or limit parameter. Will confirm in the PR comments

gaoyangxiaozhu avatar Jun 19 '24 09:06 gaoyangxiaozhu

@rui-mo / @PHILO-HE / @FelixYBW since the PR https://github.com/facebookincubator/velox/pull/6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

gaoyangxiaozhu avatar Jun 25 '24 06:06 gaoyangxiaozhu

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

@jackylee-ch, what do you think? I guess you took over the work of 6155.

PHILO-HE avatar Jun 26 '24 01:06 PHILO-HE

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?

jackylee-ch avatar Jun 26 '24 01:06 jackylee-ch

@rui-mo / @PHILO-HE / @FelixYBW since the PR #6155 only consider constant pattern support and has no update long time, could you start review my Pr for split improvement to let's help make this part finalize asap ?

Sorry for late response. It's ok to move on this PR since it covers #6155. Just concern about the performance impact of this PR. @gaoyangxiaozhu, maybe create a benchmark in gluten and make sure it won't cause negative performance gains?

sure, will do. @rui-mo / @PHILO-HE so please help review.

gaoyangxiaozhu avatar Jun 26 '24 07:06 gaoyangxiaozhu

@gaoyangxiaozhu Could you check the corner cases and corresponding tests from my PR first? Thanks. https://github.com/facebookincubator/velox/pull/8825/files#diff-ebc054da2ecc11ac3cf20d47cbc112f9c3b9cf3d5f1784b2149fb0f4e997a2d8R127-R132

updated and added ut to cover - https://github.com/gayangya/velox/blob/eddfb43831ac083a53ad17fcdca28bbd0c4b1e37/velox/functions/sparksql/tests/SplitFunctionsTest.cpp#L250

gaoyangxiaozhu avatar Jun 27 '24 08:06 gaoyangxiaozhu

Could you run the fuzzer test for 5 minute using a command like below?

./velox/expression/fuzzer/spark_expression_fuzzer_test \
                 --seed ${RANDOM} \
                 --duration_sec 300 \
                 --minloglevel=0 \
                 --stderrthreshold=2 \
                 --only split

thanks @rui-mo , update a new commit to fix one regex match issue if the delimiter is not empty but has empty size match, and also locally run fuzz test 5 minutes and it passed.

./velox/expression/fuzzer/spark_expression_fuzzer_test --seed 1 --velox_fuzzer_enable_complex_types --duration_sec 300 --minloglevel=0 --stderrthreshold=2 --only split

image

gaoyangxiaozhu avatar Jul 04 '24 08:07 gaoyangxiaozhu

I also run again locally for fuzz test 5 minutes and works fine

gaoyangxiaozhu avatar Jul 05 '24 08:07 gaoyangxiaozhu

@rui-mo any new comments

gaoyangxiaozhu avatar Jul 09 '24 02:07 gaoyangxiaozhu

kindly ping @rui-mo for any new comments, thank you!

gaoyangxiaozhu avatar Jul 11 '24 13:07 gaoyangxiaozhu

Thanks. Will mark it if no other comments from @PHILO-HE @jinchengchenghh.

cool! thank you @rui-mo, then @PHILO-HE / @jinchengchenghh please help update if you have any new comments

gaoyangxiaozhu avatar Jul 12 '24 02:07 gaoyangxiaozhu

please help review and let's make sure if can make this PR ready this week @PHILO-HE / @jinchengchenghh , thank you!

gaoyangxiaozhu avatar Jul 15 '24 01:07 gaoyangxiaozhu

Only one @PHILO-HE 's question, I think we should check string is flat for strings->data() optimization, but we also have a else branch that not check string only check delimiter and limit is constant, because both dictionary vector and lazy vector is common. We should not let it move to non-constant delimiter and non-constant limit branch.

i got your point , but it is actually optional only for another optmization. Let me do in following patch.

gaoyangxiaozhu avatar Jul 15 '24 11:07 gaoyangxiaozhu

@PHILO-HE any new comments ?

gaoyangxiaozhu avatar Jul 15 '24 11:07 gaoyangxiaozhu

but it is actually optional only for another optmization. Let me do in following patch.

@gaoyangxiaozhu It make sense to me to add current fast path. Perhaps we can take it as a follow-up to use a benchmark to determine if further optimization is needed.

Would you address this comment https://github.com/facebookincubator/velox/pull/10248#discussion_r1677375306 ?

rui-mo avatar Jul 15 '24 11:07 rui-mo

but it is actually optional only for another optmization. Let me do in following patch.

@gaoyangxiaozhu It make sense to me to add current fast path. Perhaps we can take it as a follow-up to use a benchmark to determine if further optimization is needed.

Would you address this comment #10248 (comment) ?

done

gaoyangxiaozhu avatar Jul 15 '24 13:07 gaoyangxiaozhu

@mbasmanova for any new comments ?

gaoyangxiaozhu avatar Jul 25 '24 11:07 gaoyangxiaozhu

@mbasmanova ping again

gaoyangxiaozhu avatar Jul 26 '24 08:07 gaoyangxiaozhu

could we help speed up the review process, thank you

gaoyangxiaozhu avatar Jul 29 '24 03:07 gaoyangxiaozhu

LGTM Thanks!

jinchengchenghh avatar Jul 29 '24 06:07 jinchengchenghh

ping @mbasmanova

gaoyangxiaozhu avatar Jul 31 '24 08:07 gaoyangxiaozhu

kindly ping @mbasmanova

gaoyangxiaozhu avatar Aug 07 '24 02:08 gaoyangxiaozhu

@xiaoxmeng Can you merge the PR?

FelixYBW avatar Aug 12 '24 08:08 FelixYBW

@xiaoxmeng has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Aug 12 '24 16:08 facebook-github-bot

@xiaoxmeng merged this pull request in facebookincubator/velox@b35bd6169c5f1aa8f0f3cfe75a89d5b8733be65d.

facebook-github-bot avatar Aug 12 '24 22:08 facebook-github-bot

Conbench analyzed the 1 benchmark run on commit b35bd616.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

conbench-facebook[bot] avatar Aug 12 '24 23:08 conbench-facebook[bot]